Tumor classification by combining PNN classifier ensemble with neighborhood rough set based gene reduction
Introduction
Tumor is identified as systematic biology diseases [1]. So far the mechanism of tumor development is not thoroughly known yet. Since tumor treatment of patients of later stage cancers is often not therapeutically effective, medical experts agree that early diagnosis of tumor is of great benefit to the successful therapies of tumor. However, it is difficult for traditional tumor mass detection techniques, such as X-ray imaging, to conduct early detection of tumor. In recent 10 years, gene expression profiles (GEP) based molecular diagnosis of tumor have attracted a great number of medical researchers and computer scientists for the goal of realizing precise and early tumor diagnosis [2], [3], [4], [5], [6]. However, the curse of dimensionality caused by high dimensionality and small sample size of tumor dataset seriously challenges the tumor classification. So how to select important gene subsets from thousands of genes in GEP dataset to drastically reduce the dimensionality of tumor dataset is the first key step to address this problem. Usually, the prediction performance of the selected gene subsets is evaluated by a classifier. The commonly used classifiers including support vector machines (SVM) [7], [8], [9], [10], k-nearest neighbor (k-NN) [11], [12], C4.5 [13], artificial neural networks (ANN) [14], [15], self-organizing map (SOM) [16], self-organizing tree algorithm (SOTA) [17], and probabilistic neural networks (PNN) [18], [19], [20] have been extensively applied to the molecular classification of tumor subtypes for more accurately and reliably clinical diagnosis. From those experimental results, we could conclude that gene selection such as selecting informative genes by using regulation probability [21] and by using independent component analysis [22] plays an important role in tumor classification.
Finding minimum tumor-related gene subsets can really improve the predictive performance of classification model because too many redundant or irrelevant genes might degrade the classification accuracy [23]. In addition to removing noise in GEP, the selected gene subsets also have important biomedical meanings and may be applied to the discovery of drug targets. Generally speaking, gene selection methods are categorized into two groups [24]. One is Wrapper methods which combine gene selection with a classifier, and another is Filter methods in which the procedure of gene selection is independent of classifiers. In most cases, Wrapper methods is superior to Filter methods in improving classification accuracy [25]. However, Wrapper methods by adopting different classifiers usually obtain different optimal gene subsets, which indicates that the Wrapper methods would be unstable in gene selection to some extent because the obtained accuracy is sensitive to the selected gene subsets. Another drawback is their high computational time. These are intrinsic drawbacks for most of the existing Wrapper methods when facing the curse of dimensionality and a variety of uncertainties in tumor dataset (the gathering process of microarray data including fabrication, hybridization and image processing always adds various sources of noise) [26], [27]. To address these problems, traditional intelligent methods are apt to over-fitting in classifying tumor dataset due to the lack of training sample set [28]. In fact, there are numerously optimal gene subsets with very high classification accuracy in tumor dataset [29], [30], which is mainly caused by gene co-expression and the function similarity of many genes, so how to obtain convincingly classification accuracy from these optimal gene subsets is still an important problem.
Solutions to the above problem include various ensemble schemes [31], [32], [33], [34], [35]. These studies suggested that ensemble machine learning or classifiers consistently perform better [13] in that a powerful classifier can be constructed by the ensemble of many base classifiers even though these base classifiers are weak in making decisions [36]. For example, Peng [27] proposed a robust ensemble approach to tumor classification by generating a pool of candidate base classifiers based on gene sub-sampling and then selecting a set of appropriate base classifiers to construct a high performance classification committee based on classifier clustering. Both theoretical and experimental studies have shown that the integrating of a set of diverse and accurate base classifiers would lead to a powerful ensemble classifier, where the diversity of base classifiers is prerequisite to the powerful ensemble classifier that outperforms each base classifier [37], because combining a set of same classifiers will not intuitively generates any improvement. However, most of the conventional ensemble methods employed to tumor classification such as re-sampling methods based on samples or gene re-sampling are so random that their biological meanings are difficult to interpret. Therefore, the diversity and accuracy of base classifiers should be considered simultaneously in designing an ensemble classifier. In this study, we propose a novel ensemble method which combines base PNN classifiers with neighborhood rough set model based gene reduction. Experiments on three well-known tumor datasets show that the proposed methods not only have higher classification accuracy rate but also are more stable in classification performance.
The remainder of this paper is organized as follows. In Section 2, we first introduced the neighborhood rough set model for gene reduction, the framework of PNN ensemble algorithm and two gene pre-selection methods: an iterative search margin based algorithm and a weighted feature score criterion. Section 3 described our four experimental methods and provided their experimental results on three well-known tumor datasets and the biomedical interpretation of some selected genes. Comparison with other related works were also roughly performed in this section. Finally, Section 4 presented the conclusions.
Section snippets
Neighborhood rough set model
How to generate diverse base classifiers is a critical problem in ensemble machine learning. In our ensemble method, diverse base classifiers were produced by diverse gene subsets obtained by using gene reduction based on neighborhood rough set model (NRSM) [38], [39]. The principle of NRSM was briefly introduced as follows.
Let be a set of genes and be a set of samples. The corresponding gene expression matrix can be represented as , where is the expression
Sample datasets
The proposed method is applied to three published tumor datasets: leukemia dataset [16], colon tumor dataset [50] and small round blue cell tumor (SRBCT) dataset [51]. The leukemia and colon tumor dataset contain only two subclasses, respectively, as shown in Table 1. From the web site: http://research.nhgri.nih.gov/microarray/Supplement, we downloaded the SRBCT dataset which contains 88 samples with 2,308 genes in each sample as shown in Table 2. According to Ref. [51], there are 63 training
Conclusions
Finding tumor-related genes is helpful for the personalized medicine and earlier tumor diagnosis [35]. In this paper, we designed a new ensemble method for tumor classification. This method began with gene ranking based gene selection by using Simba algorithm or WFSC criterion, then applied FARNeM-based gene reduction to obtain 100 gene subsets with which 100 base PNN classifiers were trained, respectively. Finally, the top 25 optimal base PNN classifiers were integrated by majority voting
Conflict of interest statement
None Declared.
Acknowledgments
This work was supported by the National Science Foundation of China, (Grant nos. 60973153 and 30700161), the Guide Project of Innovative Base of Chinese Academy of Sciences (Grant no. KSCX1-YW-R-30), the Knowledge Innovation Program of the Chinese Academy of Sciences (0823A16121), and the China Postdoctoral Science Foundation (Grant no. 20090450707).
Shulin Wang was born in Sichuan, China. Currently, he is working as Postdoctor at Intelligent Computing Lab, Heifei Institute of Intelligent Machines, Chinese Academy of Sciences, China. He obtained his Ph.D. degree in the National University of Defense Technology, China. He received his M.Sc. degree in Computer Application from the National University of Defense Technology, China, in 1997, and obtained his B.Sc. degree in Computer Application from China University of Geosciences in 1989. He
References (103)
- et al.
Cancer: a system biology disease
BioSystems
(2006) - et al.
ESVM: evolutionary support vector machine for automatic feature selection and classification of microarray data
BioSystems
(2007) - et al.
Gene expression correlates of clinical prostate cancer behavior
Cancer Cell
(2002) - et al.
Bayesian applications of belief networks and multilayer perceptrons for ovarian tumor classification with rejection
Artificial Intelligence in Medicine
(2003) - et al.
Tumor tissue identification based on gene expression data using DWT feature extraction and PNN classifier
Neurocomputing
(2006) - et al.
Selecting a minimal number of relevant genes from microarray data to design accurate tissue classifiers
Biosystems
(2007) A novel ensemble machine learning for robust microarray data classification
Computers in Biology and Medicine
(2006)- et al.
Feature selection algorithms to find strong genes
Pattern Recognition Letters
(2005) - et al.
Random voronoi ensembles for gene selection
Neurocomputing
(2003) - et al.
Ensemble methods for classification of patients for personalized medicine with high-dimensional data
Artificial Intelligence in Medicine
(2007)
Neighborhood classifiers
Expert Systems with Applications
hPepT1 transports muramyl dipeptide, activating NF-kappaB and stimulating IL-8 secretion in human colonic Caco2/bbe cells
Gastroenterology
Thymidylate synthase gene polymorphism as a prognostic factor for colon cancer
Journal of Gastrointestinal Surgery
Cell-surface nucleolin is a signal transducing P-selectin binding protein for human colon carcinoma cells
Experimental Cell Research
CD33 responses are blocked by SOCS3 through accelerated proteasomal-mediated turnover
Blood
Molecular profiling of LGL leukemia reveals role of sphingolipid signaling in survival of cytotoxic lymphocytes
Blood
Molecular characterization of human Zyxin
The Journal of Biological Chemistry
Identification of a S100 calcium-binding protein expressed in HL-60 cells treated with all-trans retinoic acid by two-dimensional electrophoresis and mass spectrometry
Leukemia Research
Clinical implications of PRAME gene expression in childhood
Cancer Genet Cytogenet
Promoter hypermethylation of cancer-related genes: a strong independent prognostic factor in acute lymphoblastic leukemia
Blood
The insulin-like growth factor II (IGF-II)/mannose 6-phosphate receptor mediates IGF-II-induced motility in human rhabdomyosarcoma cells
Journal of Biological Chemistry
The cyclin D1 gene is transcriptionally repressed by caveolin-1
Journal of Biological Chemistry
BagBoosting for tumor classification with gene expression data
Bioinformatics
New algorithm for multi-class cancer diagnosis using tumor gene expression signatures
Bioinformatics
Simple decision rules for classifying human cancers from gene expression profiles
Bioinformatics
Dimension reduction for classification with gene expression microarray data
Statistical Applications in Genetics and Molecular Biology
Visualization-based cancer microarray data classification analysis
Bioinformatics
Gene selection for cancer classification using support vector machine
Machine Learning
Multiple SVM-RFE for gene selection in cancer classification with expression data
IEEE Transactions on NanoBioscience
Gene assessment and sample classification for gene expression data using a genetic algorithm/k-nearest neighbor method
Combinatorial Chemistry & High Throughput Screening
Ensemble machine learning on gene expression data for cancer classification
Applied Bioinformatics
Molecular classification of cancer: class discovery and class prediction by gene expression monitoring
Science
Regulation probability method for gene selection
Pattern Recognition Letters
Tumor clustering using non-negative matrix factorization with gene selection
IEEE Transactions on Information Technology in Biomedicine
Wrappers for feature subset selection
Artificial Intelligence
Quantitative quality control in microarray experiments and the application in data filtering, normalization and false positive rate prediction
Bioinformatics
Heuristic breadth-first search algorithm for informative gene selection based on gene expression profiles
Chinese Journal of Computers
Cancer classification using ensemble of neural networks with multiple significant gene subsets
Applied Intelligence
Ensembles of nearest neighbors for gene expression based cancer classification
Studies in Computational Intelligence
The Elements of Statistical Learning: Data Mining, Inference, and Prediction
Numerical attribute reduction based on neighborhood granulation and rough approximation
Journal of Software
Cited by (96)
An ensemble classifier through rough set reducts for handling data with evidential attributes
2023, Information SciencesA review on machine learning techniques for acute leukemia classification
2021, Biosignal Processing and Classification Using Computational Learning and Intelligence: Principles, Algorithms, and ApplicationsDocking-generated multiple ligand poses for bootstrapping bioactivity classifying Machine Learning: Repurposing covalent inhibitors for COVID-19-related TMPRSS2 as case study
2021, Computational and Structural Biotechnology JournalGranule structures, distances and measures in neighborhood systems
2019, Knowledge-Based SystemsCitation Excerpt :Yao [32] and Hu [33] proposed the neighborhood rough set model that can deal with the knowledge classification systems with continuous values. It has been widely used in attribute reduction [34–36], feature selection and extraction [37–40], classification and clustering [23,41,42], gene selection [43–45], image processing [46] etc. However, neighborhood relations are not strict equivalence relations, the classical tools and methods of uncertainty measurement are not suitable to the neighborhood knowledge classification systems.
INTEGRATED PATH STABILITY SELECTION
2024, arXivProposed Two-Steps Procedure of Classification High Dimensional Data with Regularized Logistic Regression
2024, Statistics, Optimization and Information Computing
Shulin Wang was born in Sichuan, China. Currently, he is working as Postdoctor at Intelligent Computing Lab, Heifei Institute of Intelligent Machines, Chinese Academy of Sciences, China. He obtained his Ph.D. degree in the National University of Defense Technology, China. He received his M.Sc. degree in Computer Application from the National University of Defense Technology, China, in 1997, and obtained his B.Sc. degree in Computer Application from China University of Geosciences in 1989. He also worked in Hunan University from 2000 to 2007.
Research Interests: Bioinformatics, Software Engineering, and Complex System.