Gene selection for cancer tumor detection using a novel memetic algorithm with a multi-view fitness function
Introduction
In recent years there has been an explosion in the speed of attainment of biomedical data. Progresses in molecular genetics technologies, such as DNA microarrays (Hedge, 2000), lead us to gain a global view of the cell. Microarray is a valuable technique for measuring expression data of thousands of genes simultaneously. An important rising medical application domain for microarray gene expression profiling technology is medical decision support in the form of diagnosis of disease as well as prediction of clinical outcomes in response to treatment. The two areas in medicine that currently attract the most attention in that respect are management of cancer and infectious diseases (Ntzani and Ioannidis, 2003).The prediction of the diagnostic category of a tissue sample from its expression array phenotype using the microarray data from tissues in identified categories is known as classification. The samples are usually the experiments and the categories are the types of tissue samples. A number of systematic methods have been developed and studied to classify cancer types using gene expression data (Khan et al., 2001, Golub et al., 1999).
One of the most important challenges of the microarray classification problem is to eliminate redundant genes that are unrelated to classification accuracy. Selecting related genes from all testing genes is called “Gene Selection”, which corresponds to feature selection in pattern classification. Many gene selection approaches from statistical analysis (Khan et al., 2001) and a Bayesian model (Lee et al., 2003) to Fisher's linear discriminant analysis (Xiong et al., 2001) and support vector machines (SVM) (Guyon et al., 2002, Tang et al., 2006) have been developed over the years. There have been several studies (Statnikov et al., 2005, Ramaswamy et al., 2001) where different feature selection methods and classification systems have been compared.
In analyzing expression data by designing classifiers, it is better to provide additional biological knowledge associated for verifying the selected genes rather than only emphasizing on high classification accuracy and small number of used genes. Several frequently used techniques for designing classifiers from microarray data, such as the support vector machine (Nakashima et al., 2005), neural networks (Yi-Chung, 2007), k-nearest neighbor (Wang and Lee, 2002), and the logistic regression model (Ooi and Tan, 2003), have low interpretabilities. In this paper, designing a precise and solid fuzzy rule-based classification system with linguistic interpretability using a small number of related genes has been investigated. The results of this work would be valuable for microarray data analysis and development of economical diagnostic tests. The desirable classifier is a set of fuzzy rules with linguistic interpretability where each rule is as the similar form: if gene A is High and gene B is Low, then the type of tumor is X.
In this study, we have proposed an interpretable gene expression classifier (named CD-MFS, i.e., Cancer Diagnosis with Memetic Fuzzy System) with an accurate and compact fuzzy rule base for cancer tumor detection from microarray data. The design of CD-MFS has three objectives to be simultaneously optimized: maximal classification accuracy, minimum number of rules, and minimum number of used genes. In designing CD-MFS, a new memetic algorithm with Multi-View fitness function approach is used to efficiently solve the design problem with a large number of tuning parameters. The new presented Multi-View fitness function considers two deferent evaluating procedures. The first procedure evaluates each single fuzzy if–then rule based on the specified rule quality, while the second procedure considers the quality of each fuzzy rule according to the resulted fuzzy rule set fitness.
The rest of the paper is organized as follows: in Section 2, we have presented a summery review on the basic provided conventions in this paper. Section 3 introduces our fuzzy memetic-based rule learning algorithm. The following section will discuss the experimental results which we have obtained. In the last section of the paper we have derived some conclusions.
Section snippets
Biological data overview
All organisms, except viruses, are composed of cells. There are trillions of cells in the human body. There are chromosomes in cell cores and in chromosomes there is DNA. DNA is composed of two coding and encoding parts, the coding part is referred to as a gene. Genes are codes that produce proteins. Proteins are large molecules which are the basis of any organism. All cells in one organism have same genes, but these genes have different expressions in different conditions and times (KDnuggets
Related works
Computational analysis and computing can help researchers to collate a group of signature genes for a certain disease (Wang et al., 2005, Buturovic, 2006). Since the price of microarray chips is very high and also we have not enough tissue samples from cancer patients, the number of records in microarray datasets is usually too few, which is not suitable for most machine learning algorithms. In addition, the processing and material used for microarray analysis is typically different between
Proposed method (CD-MFS)
This section presents the memetic algorithm, which the paper proposes in two subsections. Section 4.1 briefly explains fuzzy if–then rules and a fuzzy reasoning method for pattern classification problems with continuous attributes. A heuristic procedure is also described to determine the consequent class and the certainty grade of each fuzzy if–then rule from training patterns. This heuristic procedure is an adapted version of the one which has been introduced by Ishibuchi et al. (1992).The
Experimental results
We have used the 14_Tumors data set to evaluate the performance of the CD-MFS classification system in cancer diagnosis. The 14_Tumors dataset has been gathered by Ramaswamy et al. (2001), and involves 308 samples and 15009 genes (Ramaswamy et al., 2001). This data set has an important property; it has 14 different cancer tumors and is known as a comprehensive cancer dataset. These tumor types consist of breast, prostate, lung, colorectal, lymphoma, bladder transitional cell carcinoma,
Conclusion
In this paper we provided a CD-MFS algorithm based on the memetic evolutionary idea that can classify gene expression data with an accurate set of fuzzy if–then rules. It begins with low quality fuzzy if–then rules, and results in a high quality rule set. This algorithm has acceptable accuracy and classifies cancerous and benign tumors efficiently. Our proposed algorithm was evaluated on the 14_Tumors cancer dataset and compared with other classification systems. Results indicate that our
Acknowledgment
This work was supported by Iran National Science Foundation (INSF).
References (56)
Exploiting scale-free information from expression data for cancer classification
Comput. Biol. Chem.
(2005)Gene expression profile class prediction using linear Bayesian classifiers
Comput. Biol. Med.
(2007)- et al.
An interpretable fuzzy rule-based classification methodology for medical diagnosis
Artif. Intell. Med.
(2009) - et al.
Selecting a minimal number of relevant genes from microarray data to design accurate tissue classifiers
BioSystems
(2007) - et al.
Distributed representation of fuzzy rules and its application to pattern classification
Fuzzy Sets Syst.
(1992) - et al.
Compact cancer biomarkers discovery using a swarm intelligence feature selection algorithm
Comput. Biol. Chem.
(2010) The impact of parametrization in memetic evolutionary algorithms
Theor. Comput. Sci.
(2009)- et al.
Evolutionary computing for knowledge discovery in medical diagnosis
Artif. Intell. Med.
(2003) - et al.
Feature (gene) selection in gene expression-based tumor classification
Mol. Genet. Metab.
(2001) PCP: a program for supervised classification of gene expression profiles
Bioinformatics
(2006)
Outcome signature genes in breast cancer: is there a unique set?
Bioinformatics
Molecular classification of cancer: class discovery and class prediction by gene expression monitoring
Science
Gene selection for cancer classification using support vector machines: an evaluation of gene selection methods for multi-class microarray data classification
Mach. Learn.
A concise guide to cDNA microarray analysis
Bio. Tech.
Large-scale clustering of cDNA-fingerprinting data
Genome Res.
Improvements to Platt's SMO algorithm for SVM classifier design
Neural Comput.
Cited by (42)
A multitasking multi-objective differential evolution gene selection algorithm enhanced with new elite and guidance strategies for tumor identification
2024, Expert Systems with ApplicationsExplicit and size-adaptive PSO-based feature selection for classification
2023, Swarm and Evolutionary ComputationEfficient high-dimension feature selection based on enhanced equilibrium optimizer
2022, Expert Systems with ApplicationsAn interactive filter-wrapper multi-objective evolutionary algorithm for feature selection
2021, Swarm and Evolutionary ComputationMicroarray cancer feature selection: Review, challenges and research directions
2020, International Journal of Cognitive Computing in EngineeringVariable selection in classification for multivariate functional data
2019, Information SciencesCitation Excerpt :Particularly, some references in very high-dimensional problems such as cancer detection via gene expression data, must be mentioned. The work of [36] presents a theoretical and practical framework for feature selection based on a conditional mutual information criterion. [35,50] focus on the chemotherapy effectiveness problems solved by means of ranking (SVM-RFE) and fuzzy if-then rules, respectively.