Tumor classification by combining PNN classifier ensemble with neighborhood rough set based gene reduction

doi:10.1016/j.compbiomed.2009.11.014

Computers in Biology and Medicine

Volume 40, Issue 2, February 2010, Pages 179-189

https://doi.org/10.1016/j.compbiomed.2009.11.014 Get rights and content

Abstract

Since Golub applied gene expression profiles (GEP) to the molecular classification of tumor subtypes for more accurately and reliably clinical diagnosis, a number of studies on GEP-based tumor classification have been done. However, the challenges from high dimension and small sample size of tumor dataset still exist. This paper presents a new tumor classification approach based on an ensemble of probabilistic neural network (PNN) and neighborhood rough set model based gene reduction. Informative genes were initially selected by gene ranking based on an iterative search margin algorithm and then were further refined by gene reduction to select many minimum gene subsets. Finally, the candidate base PNN classifiers trained by each of the selected gene subsets were integrated by majority voting strategy to construct an ensemble classifier. Experiments on tumor datasets showed that this approach can obtain both high and stable classification performance, which is not too sensitive to the number of initially selected genes and competitive to most existing methods. Additionally, the classification results can be cross-verified in a single biomedical experiment by the selected gene subsets, and biologically experimental results also proved that the genes included in the selected gene subsets are functionally related to carcinogenesis, indicating that the performance obtained by the proposed method is convincing.

Introduction

Tumor is identified as systematic biology diseases [1]. So far the mechanism of tumor development is not thoroughly known yet. Since tumor treatment of patients of later stage cancers is often not therapeutically effective, medical experts agree that early diagnosis of tumor is of great benefit to the successful therapies of tumor. However, it is difficult for traditional tumor mass detection techniques, such as X-ray imaging, to conduct early detection of tumor. In recent 10 years, gene expression profiles (GEP) based molecular diagnosis of tumor have attracted a great number of medical researchers and computer scientists for the goal of realizing precise and early tumor diagnosis [2], [3], [4], [5], [6]. However, the curse of dimensionality caused by high dimensionality and small sample size of tumor dataset seriously challenges the tumor classification. So how to select important gene subsets from thousands of genes in GEP dataset to drastically reduce the dimensionality of tumor dataset is the first key step to address this problem. Usually, the prediction performance of the selected gene subsets is evaluated by a classifier. The commonly used classifiers including support vector machines (SVM) [7], [8], [9], [10], k-nearest neighbor (k-NN) [11], [12], C4.5 [13], artificial neural networks (ANN) [14], [15], self-organizing map (SOM) [16], self-organizing tree algorithm (SOTA) [17], and probabilistic neural networks (PNN) [18], [19], [20] have been extensively applied to the molecular classification of tumor subtypes for more accurately and reliably clinical diagnosis. From those experimental results, we could conclude that gene selection such as selecting informative genes by using regulation probability [21] and by using independent component analysis [22] plays an important role in tumor classification.

Finding minimum tumor-related gene subsets can really improve the predictive performance of classification model because too many redundant or irrelevant genes might degrade the classification accuracy [23]. In addition to removing noise in GEP, the selected gene subsets also have important biomedical meanings and may be applied to the discovery of drug targets. Generally speaking, gene selection methods are categorized into two groups [24]. One is Wrapper methods which combine gene selection with a classifier, and another is Filter methods in which the procedure of gene selection is independent of classifiers. In most cases, Wrapper methods is superior to Filter methods in improving classification accuracy [25]. However, Wrapper methods by adopting different classifiers usually obtain different optimal gene subsets, which indicates that the Wrapper methods would be unstable in gene selection to some extent because the obtained accuracy is sensitive to the selected gene subsets. Another drawback is their high computational time. These are intrinsic drawbacks for most of the existing Wrapper methods when facing the curse of dimensionality and a variety of uncertainties in tumor dataset (the gathering process of microarray data including fabrication, hybridization and image processing always adds various sources of noise) [26], [27]. To address these problems, traditional intelligent methods are apt to over-fitting in classifying tumor dataset due to the lack of training sample set [28]. In fact, there are numerously optimal gene subsets with very high classification accuracy in tumor dataset [29], [30], which is mainly caused by gene co-expression and the function similarity of many genes, so how to obtain convincingly classification accuracy from these optimal gene subsets is still an important problem.

Solutions to the above problem include various ensemble schemes [31], [32], [33], [34], [35]. These studies suggested that ensemble machine learning or classifiers consistently perform better [13] in that a powerful classifier can be constructed by the ensemble of many base classifiers even though these base classifiers are weak in making decisions [36]. For example, Peng [27] proposed a robust ensemble approach to tumor classification by generating a pool of candidate base classifiers based on gene sub-sampling and then selecting a set of appropriate base classifiers to construct a high performance classification committee based on classifier clustering. Both theoretical and experimental studies have shown that the integrating of a set of diverse and accurate base classifiers would lead to a powerful ensemble classifier, where the diversity of base classifiers is prerequisite to the powerful ensemble classifier that outperforms each base classifier [37], because combining a set of same classifiers will not intuitively generates any improvement. However, most of the conventional ensemble methods employed to tumor classification such as re-sampling methods based on samples or gene re-sampling are so random that their biological meanings are difficult to interpret. Therefore, the diversity and accuracy of base classifiers should be considered simultaneously in designing an ensemble classifier. In this study, we propose a novel ensemble method which combines base PNN classifiers with neighborhood rough set model based gene reduction. Experiments on three well-known tumor datasets show that the proposed methods not only have higher classification accuracy rate but also are more stable in classification performance.

The remainder of this paper is organized as follows. In Section 2, we first introduced the neighborhood rough set model for gene reduction, the framework of PNN ensemble algorithm and two gene pre-selection methods: an iterative search margin based algorithm and a weighted feature score criterion. Section 3 described our four experimental methods and provided their experimental results on three well-known tumor datasets and the biomedical interpretation of some selected genes. Comparison with other related works were also roughly performed in this section. Finally, Section 4 presented the conclusions.

Section snippets

Neighborhood rough set model

How to generate diverse base classifiers is a critical problem in ensemble machine learning. In our ensemble method, diverse base classifiers were produced by diverse gene subsets obtained by using gene reduction based on neighborhood rough set model (NRSM) [38], [39]. The principle of NRSM was briefly introduced as follows.

Let $G = {g_{1}, \dots g_{n}}$ be a set of genes and $S = {s_{1}, \dots s_{m}}$ be a set of samples. The corresponding gene expression matrix can be represented as $X = (x_{i, j})_{m \times n}$ , where $x_{i, j}$ is the expression

Sample datasets

The proposed method is applied to three published tumor datasets: leukemia dataset [16], colon tumor dataset [50] and small round blue cell tumor (SRBCT) dataset [51]. The leukemia and colon tumor dataset contain only two subclasses, respectively, as shown in Table 1. From the web site: http://research.nhgri.nih.gov/microarray/Supplement, we downloaded the SRBCT dataset which contains 88 samples with 2,308 genes in each sample as shown in Table 2. According to Ref. [51], there are 63 training

Conclusions

Finding tumor-related genes is helpful for the personalized medicine and earlier tumor diagnosis [35]. In this paper, we designed a new ensemble method for tumor classification. This method began with gene ranking based gene selection by using Simba algorithm or WFSC criterion, then applied FARNeM-based gene reduction to obtain 100 gene subsets with which 100 base PNN classifiers were trained, respectively. Finally, the top 25 optimal base PNN classifiers were integrated by majority voting

Conflict of interest statement

None Declared.

Acknowledgments

This work was supported by the National Science Foundation of China, (Grant nos. 60973153 and 30700161), the Guide Project of Innovative Base of Chinese Academy of Sciences (Grant no. KSCX1-YW-R-30), the Knowledge Innovation Program of the Chinese Academy of Sciences (0823A16121), and the China Postdoctoral Science Foundation (Grant no. 20090450707).

Shulin Wang was born in Sichuan, China. Currently, he is working as Postdoctor at Intelligent Computing Lab, Heifei Institute of Intelligent Machines, Chinese Academy of Sciences, China. He obtained his Ph.D. degree in the National University of Defense Technology, China. He received his M.Sc. degree in Computer Application from the National University of Defense Technology, China, in 1997, and obtained his B.Sc. degree in Computer Application from China University of Geosciences in 1989. He

References (103)

J.J. Hornberg et al.
Cancer: a system biology disease
BioSystems
(2006)
H.L. Huang et al.
ESVM: evolutionary support vector machine for automatic feature selection and classification of microarray data
BioSystems
(2007)
D. Singh et al.
Gene expression correlates of clinical prostate cancer behavior
Cancer Cell
(2002)
P. Antal et al.
Bayesian applications of belief networks and multilayer perceptrons for ovarian tumor classification with rejection
Artificial Intelligence in Medicine
(2003)
G.M. Sun et al.
Tumor tissue identification based on gene expression data using DWT feature extraction and PNN classifier
Neurocomputing
(2006)
H.L. Huang et al.
Selecting a minimal number of relevant genes from microarray data to design accurate tissue classifiers
Biosystems
(2007)
Y.H. Peng
A novel ensemble machine learning for robust microarray data classification
Computers in Biology and Medicine
(2006)
P.J.S. Silva et al.
Feature selection algorithms to find strong genes
Pattern Recognition Letters
(2005)
F. Masulli et al.
Random voronoi ensembles for gene selection
Neurocomputing
(2003)
H. Moon et al.
Ensemble methods for classification of patients for personalized medicine with high-dimensional data
Artificial Intelligence in Medicine
(2007)

Q.H. Hu et al.

Neighborhood classifiers

Expert Systems with Applications

(2008)

S.R. Vavricka et al.

hPepT1 transports muramyl dipeptide, activating NF-kappaB and stimulating IL-8 secretion in human colonic Caco2/bbe cells

Gastroenterology

(2004)

K.W. Suh et al.

Thymidylate synthase gene polymorphism as a prognostic factor for colon cancer

Journal of Gastrointestinal Surgery

(2005)

E.M. Reyes-Reyes et al.

Cell-surface nucleolin is a signal transducing P-selectin binding protein for human colon carcinoma cells

Experimental Cell Research

(2008)

S.J. Orr et al.

CD33 responses are blocked by SOCS3 through accelerated proteasomal-mediated turnover

Blood

(2007)

M.V. Shah et al.

Molecular profiling of LGL leukemia reveals role of sphingolipid signaling in survival of cytotoxic lymphocytes

Blood

(2008)

T. Macalma et al.

Molecular characterization of human Zyxin

The Journal of Biological Chemistry

(1996)

Y.Z. Wu et al.

Identification of a S100 calcium-binding protein expressed in HL-60 cells treated with all-trans retinoic acid by two-dimensional electrophoresis and mass spectrometry

Leukemia Research

(2004)

D. Steinbach et al.

Clinical implications of PRAME gene expression in childhood

Cancer Genet Cytogenet

(2002)

J. Roman-Gomez et al.

Promoter hypermethylation of cancer-related genes: a strong independent prognostic factor in acute lymphoblastic leukemia

Blood

(2004)

C.P. Minniti et al.

The insulin-like growth factor II (IGF-II)/mannose 6-phosphate receptor mediates IGF-II-induced motility in human rhabdomyosarcoma cells

Journal of Biological Chemistry

(1992)

J. Hulit et al.

The cyclin D1 gene is transcriptionally repressed by caveolin-1

Journal of Biological Chemistry

(2000)

M. Dettling

BagBoosting for tumor classification with gene expression data

Bioinformatics

(2004)

A.M. Bagirov et al.

New algorithm for multi-class cancer diagnosis using tumor gene expression signatures

Bioinformatics

(2003)

A.C. Tan et al.

Simple decision rules for classifying human cancers from gene expression profiles

Bioinformatics

(2005)

J.J. Dai et al.

Dimension reduction for classification with gene expression microarray data

Statistical Applications in Genetics and Molecular Biology

(2006)

M. Mramor et al.

Visualization-based cancer microarray data classification analysis

Bioinformatics

(2007)

I. Guyon et al.

Gene selection for cancer classification using support vector machine

Machine Learning

(2002)

K.B. Duan et al.

Multiple SVM-RFE for gene selection in cancer classification with expression data

IEEE Transactions on NanoBioscience

(2005)

A. Blanco, M. Martn-Merino, J.D.L. Rivas, Ensemble of support vector machines to improve the cancer class prediction...

L. Li et al.

Gene assessment and sample classification for gene expression data using a genetic algorithm/k-nearest neighbor method

Combinatorial Chemistry & High Throughput Screening

(2001)

A.C. Tan et al.

Ensemble machine learning on gene expression data for cancer classification

Applied Bioinformatics

(2003)

J. Ryu, S.B. Cho, Gene expression classification using optimal feature/classifier ensemble with negative correlation,...

T.R. Golub et al.

Molecular classification of cancer: class discovery and class prediction by gene expression monitoring

Science

(1999)

A. Mateos, J. Herrero, J. Tamames, J. Dopazo, Supervised neural networks for clustering conditions in DNA array data...

R. Xu, D.C. Wunsch, Probabilistic neural networks for multi-class tissue discrimination with gene expression data, in:...

C.J. Huang, W.C. Liao, A comparative study of feature selection methods for probabilistic neural networks in cancer...

H.Q. Wang et al.

Regulation probability method for gene selection

Pattern Recognition Letters

(2006)

C.H. Zheng et al.

Tumor clustering using non-negative matrix factorization with gene selection

IEEE Transactions on Information Technology in Biomedicine

(2009)

G.H. John, R. Kohavi, K. Pfleger, Irrelevant features and the subset selection problem, in: Proceedings of the Eleventh...

K. Ron et al.

Wrappers for feature subset selection

Artificial Intelligence

(1997)

X. Wang et al.

Quantitative quality control in microarray experiments and the application in data filtering, normalization and false positive rate prediction

Bioinformatics

(2003)

S.L. Wang, H.W. Chen, S.T. Li, Gene selection using neighborhood rough set from gene expression profiles, in:...

S.L. Wang et al.

Heuristic breadth-first search algorithm for informative gene selection based on gene expression profiles

Chinese Journal of Computers

(2008)

H.H. Won, S.B. Cho, Neural network ensemble with negatively correlated features for cancer classification, in: Joint...

S.B. Cho et al.

Cancer classification using ensemble of neural networks with multiple significant gene subsets

Applied Intelligence

(2007)

O. Okun et al.

Ensembles of nearest neighbors for gene expression based cancer classification

Studies in Computational Intelligence

(2008)

T. Hastie et al.

The Elements of Statistical Learning: Data Mining, Inference, and Prediction

(2001)

Y. Zhao, Y. Chen, X.Q. Zhang, A novel ensemble approach for cancer data classification, in: Fourth International...

Q.H. Hu et al.

Numerical attribute reduction based on neighborhood granulation and rough approximation

Journal of Software

(2008)

Cited by (96)

An ensemble classifier through rough set reducts for handling data with evidential attributes
2023, Information Sciences
Ensemble classifier is a well-known method that has been used to solve several machine learning problems. To have reliable results, one should ensure the build of a good ensemble. In order to do so, researchers have proposed some heuristics like Random Subspace Ensemble (RSM), Rough set bas. The drawback of these mentioned approaches is their disability to handle uncertain data especially when uncertainty is represented by the evidence theory. The aim of this paper is to adapt both RSM and Rough set based ensemble in order to let them working in the context of evidential data. Three ensemble classifier approaches based on the rough set theory have been proposed and have been compared with each other. For the comparison purpose, we have relied on Ensemble Enhanced Evidential $k$ Nearest Neighbor (EE $k$ -NN) classifier, real world datasets from the UCI repository as well as synthetic databases.
A review on machine learning techniques for acute leukemia classification
2021, Biosignal Processing and Classification Using Computational Learning and Intelligence: Principles, Algorithms, and Applications
Acute leukemia is a malignant disease characterized by an excess of immature white blood cells, which proliferate in the circulatory system and replace healthy blood cells. These abnormal cells cause that the body exposure to diseases, affecting a large proportion of the world's population. Acute leukemia is categorized into two types and ten subtypes. Hence, early detection of the particular class of acute leukemia helps to provide patients with adequate treatment. In recent years, studies have focused on the development of automatic methods to detect and classify acute leukemia and its subtypes as an alternative tool to aid in diagnosis. Among these studies, machine learning techniques have gained much attention and shown success. This chapter aims at providing an overview of the most recent advances in the use of machine learning techniques to classify acute leukemia, by examining the different stages involved in this task, such as image preprocessing, feature extraction, and classification. This chapter includes a brief analysis of these trends, emphasizing current issues and possible challenges in this area.
Docking-generated multiple ligand poses for bootstrapping bioactivity classifying Machine Learning: Repurposing covalent inhibitors for COVID-19-related TMPRSS2 as case study
2021, Computational and Structural Biotechnology Journal
In the present work we introduce the use of multiple docked poses for bootstrapping machine learning-based QSAR modelling. Ligand-receptor contact fingerprints are implemented as descriptor variables. We implemented this method for the discovery of potential inhibitors of the serine protease enzyme TMPRSS2 involved the infectivity of coronaviruses. Several machine learners were scanned, however, Xgboost, support vector machines (SVM) and random forests (RF) were the best with testing set accuracies reaching 90%. Three potential hits were identified upon using the method to scan known untested FDA approved drugs against TMPRSS2. Subsequent molecular dynamics simulation and covalent docking supported the results of the new computational approach.
Granule structures, distances and measures in neighborhood systems
2019, Knowledge-Based Systems
Citation Excerpt :
Yao [32] and Hu [33] proposed the neighborhood rough set model that can deal with the knowledge classification systems with continuous values. It has been widely used in attribute reduction [34–36], feature selection and extraction [37–40], classification and clustering [23,41,42], gene selection [43–45], image processing [46] etc. However, neighborhood relations are not strict equivalence relations, the classical tools and methods of uncertainty measurement are not suitable to the neighborhood knowledge classification systems.
High-dimensional, quantity, uncertain and diverse data sets bring serious challenges to the development of intelligent systems. Granular computing is a theoretical approach to deal with uncertain and massive data, including rough sets, fuzzy sets, quotient spaces, covering rough sets, neighborhood rough sets and etc. In this paper, by introducing the neighborhood rough set model, some structured data named neighborhood granules are formed to achieve the cognition of a neighborhood system. Then, a three-level structure of granules in the neighborhood system is proposed: the neighborhood granule, the neighborhood granule swarm and the neighborhood granule library. The size measures of neighborhood granules and neighborhood granule swarms are also presented. Furthermore, we define a variety of distance measures for the neighborhood granules and the neighborhood granule swarms, and discuss their properties and relationships. Finally, considering the uncertainties of neighborhood systems, we propose the uncertainty measures of various neighborhood granules from the perspectives of algebra and entropy, and prove the monotonicity principle of these measures. Theoretical analysis and examples show that the granule structures, distances and measures in neighborhood systems are effective tools for complex data measuring and classifying.
INTEGRATED PATH STABILITY SELECTION
2024, arXiv
Proposed Two-Steps Procedure of Classification High Dimensional Data with Regularized Logistic Regression
2024, Statistics, Optimization and Information Computing

View all citing articles on Scopus

Research Interests: Bioinformatics, Software Engineering, and Complex System.

View full text

Tumor classification by combining PNN classifier ensemble with neighborhood rough set based gene reduction

Abstract

Introduction

Section snippets

Neighborhood rough set model

Sample datasets

Conclusions

Conflict of interest statement

Acknowledgments

BioSystems

BioSystems

Cancer Cell

Artificial Intelligence in Medicine

Neurocomputing

Biosystems

Computers in Biology and Medicine

Pattern Recognition Letters

Neurocomputing

Artificial Intelligence in Medicine

Expert Systems with Applications

Gastroenterology

Journal of Gastrointestinal Surgery

Experimental Cell Research

Blood

Blood

The Journal of Biological Chemistry

Leukemia Research

Cancer Genet Cytogenet

Blood

Journal of Biological Chemistry

Journal of Biological Chemistry

BagBoosting for tumor classification with gene expression data

Bioinformatics

New algorithm for multi-class cancer diagnosis using tumor gene expression signatures

Bioinformatics

Simple decision rules for classifying human cancers from gene expression profiles

Bioinformatics

Dimension reduction for classification with gene expression microarray data

Statistical Applications in Genetics and Molecular Biology

Visualization-based cancer microarray data classification analysis

Bioinformatics

Gene selection for cancer classification using support vector machine

Machine Learning

Multiple SVM-RFE for gene selection in cancer classification with expression data

IEEE Transactions on NanoBioscience

Gene assessment and sample classification for gene expression data using a genetic algorithm/k-nearest neighbor method

Combinatorial Chemistry & High Throughput Screening

Ensemble machine learning on gene expression data for cancer classification

Applied Bioinformatics

Molecular classification of cancer: class discovery and class prediction by gene expression monitoring

Science

Regulation probability method for gene selection

Pattern Recognition Letters

Tumor clustering using non-negative matrix factorization with gene selection

IEEE Transactions on Information Technology in Biomedicine

Wrappers for feature subset selection

Artificial Intelligence

Quantitative quality control in microarray experiments and the application in data filtering, normalization and false positive rate prediction

Bioinformatics

Heuristic breadth-first search algorithm for informative gene selection based on gene expression profiles

Chinese Journal of Computers

Cancer classification using ensemble of neural networks with multiple significant gene subsets

Applied Intelligence

Ensembles of nearest neighbors for gene expression based cancer classification

Studies in Computational Intelligence

The Elements of Statistical Learning: Data Mining, Inference, and Prediction

Numerical attribute reduction based on neighborhood granulation and rough approximation

Journal of Software