Gene selection for cancer tumor detection using a novel memetic algorithm with a multi-view fitness function

https://doi.org/10.1016/j.engappai.2012.12.009Get rights and content

Abstract

Cancer is one of the key research topics in the medical field. An accurate detection of different cancer tumor types has great value in providing better treatment facilities and risk minimization for patients. Recently, DNA microarray-based gene expression profiles have been employed to correlate the clinical behavior of cancers with the differential gene expression levels in cancerous and benign tumors. An accurate classifier with linguistic interpretability using a small number of relevant genes is beneficial to microarray data analysis and development of inexpensive diagnostic tests. Several well-known and frequently used techniques for designing classifiers from microarray data, such as a support vector machine, neural networks, k-nearest neighbor, and logistic regression model, suffer from low comprehensibility. This paper proposes a new memetic algorithm which is capable of extracting interpretable and accurate fuzzy if–then rules from cancer data. This paper is the first proposal of memetic algorithms with the Multi-View fitness function approach. The new presented Multi-View fitness function considers two kinds of evaluating procedures. The first procedure, which is located in the main evolutionary structure of the algorithm, evaluates each single fuzzy if–then rule according to the specified rule quality (the evaluating procedure does not consider other rules). However, the second procedure determines the quality of each fuzzy rule according to the whole fuzzy rule set performance. In comparison to classic memetic algorithms, these kinds of memetic algorithms enhance the rule discovery process significantly.

Introduction

In recent years there has been an explosion in the speed of attainment of biomedical data. Progresses in molecular genetics technologies, such as DNA microarrays (Hedge, 2000), lead us to gain a global view of the cell. Microarray is a valuable technique for measuring expression data of thousands of genes simultaneously. An important rising medical application domain for microarray gene expression profiling technology is medical decision support in the form of diagnosis of disease as well as prediction of clinical outcomes in response to treatment. The two areas in medicine that currently attract the most attention in that respect are management of cancer and infectious diseases (Ntzani and Ioannidis, 2003).The prediction of the diagnostic category of a tissue sample from its expression array phenotype using the microarray data from tissues in identified categories is known as classification. The samples are usually the experiments and the categories are the types of tissue samples. A number of systematic methods have been developed and studied to classify cancer types using gene expression data (Khan et al., 2001, Golub et al., 1999).

One of the most important challenges of the microarray classification problem is to eliminate redundant genes that are unrelated to classification accuracy. Selecting related genes from all testing genes is called “Gene Selection”, which corresponds to feature selection in pattern classification. Many gene selection approaches from statistical analysis (Khan et al., 2001) and a Bayesian model (Lee et al., 2003) to Fisher's linear discriminant analysis (Xiong et al., 2001) and support vector machines (SVM) (Guyon et al., 2002, Tang et al., 2006) have been developed over the years. There have been several studies (Statnikov et al., 2005, Ramaswamy et al., 2001) where different feature selection methods and classification systems have been compared.

In analyzing expression data by designing classifiers, it is better to provide additional biological knowledge associated for verifying the selected genes rather than only emphasizing on high classification accuracy and small number of used genes. Several frequently used techniques for designing classifiers from microarray data, such as the support vector machine (Nakashima et al., 2005), neural networks (Yi-Chung, 2007), k-nearest neighbor (Wang and Lee, 2002), and the logistic regression model (Ooi and Tan, 2003), have low interpretabilities. In this paper, designing a precise and solid fuzzy rule-based classification system with linguistic interpretability using a small number of related genes has been investigated. The results of this work would be valuable for microarray data analysis and development of economical diagnostic tests. The desirable classifier is a set of fuzzy rules with linguistic interpretability where each rule is as the similar form: if gene A is High and gene B is Low, then the type of tumor is X.

In this study, we have proposed an interpretable gene expression classifier (named CD-MFS, i.e., Cancer Diagnosis with Memetic Fuzzy System) with an accurate and compact fuzzy rule base for cancer tumor detection from microarray data. The design of CD-MFS has three objectives to be simultaneously optimized: maximal classification accuracy, minimum number of rules, and minimum number of used genes. In designing CD-MFS, a new memetic algorithm with Multi-View fitness function approach is used to efficiently solve the design problem with a large number of tuning parameters. The new presented Multi-View fitness function considers two deferent evaluating procedures. The first procedure evaluates each single fuzzy if–then rule based on the specified rule quality, while the second procedure considers the quality of each fuzzy rule according to the resulted fuzzy rule set fitness.

The rest of the paper is organized as follows: in Section 2, we have presented a summery review on the basic provided conventions in this paper. Section 3 introduces our fuzzy memetic-based rule learning algorithm. The following section will discuss the experimental results which we have obtained. In the last section of the paper we have derived some conclusions.

Section snippets

Biological data overview

All organisms, except viruses, are composed of cells. There are trillions of cells in the human body. There are chromosomes in cell cores and in chromosomes there is DNA. DNA is composed of two coding and encoding parts, the coding part is referred to as a gene. Genes are codes that produce proteins. Proteins are large molecules which are the basis of any organism. All cells in one organism have same genes, but these genes have different expressions in different conditions and times (KDnuggets

Related works

Computational analysis and computing can help researchers to collate a group of signature genes for a certain disease (Wang et al., 2005, Buturovic, 2006). Since the price of microarray chips is very high and also we have not enough tissue samples from cancer patients, the number of records in microarray datasets is usually too few, which is not suitable for most machine learning algorithms. In addition, the processing and material used for microarray analysis is typically different between

Proposed method (CD-MFS)

This section presents the memetic algorithm, which the paper proposes in two subsections. Section 4.1 briefly explains fuzzy if–then rules and a fuzzy reasoning method for pattern classification problems with continuous attributes. A heuristic procedure is also described to determine the consequent class and the certainty grade of each fuzzy if–then rule from training patterns. This heuristic procedure is an adapted version of the one which has been introduced by Ishibuchi et al. (1992).The

Experimental results

We have used the 14_Tumors data set to evaluate the performance of the CD-MFS classification system in cancer diagnosis. The 14_Tumors dataset has been gathered by Ramaswamy et al. (2001), and involves 308 samples and 15009 genes (Ramaswamy et al., 2001). This data set has an important property; it has 14 different cancer tumors and is known as a comprehensive cancer dataset. These tumor types consist of breast, prostate, lung, colorectal, lymphoma, bladder transitional cell carcinoma,

Conclusion

In this paper we provided a CD-MFS algorithm based on the memetic evolutionary idea that can classify gene expression data with an accurate set of fuzzy if–then rules. It begins with low quality fuzzy if–then rules, and results in a high quality rule set. This algorithm has acceptable accuracy and classifies cancerous and benign tumors efficiently. Our proposed algorithm was evaluated on the 14_Tumors cancer dataset and compared with other classification systems. Results indicate that our

Acknowledgment

This work was supported by Iran National Science Foundation (INSF).

References (56)

  • Chuang, L.Y., Wu, K.C., Yang, C.H., 2009. A hybrid feature selection method using gene expression data. In: Proceedings...
  • Dodoit, S., Fridlyand, J., Speed, T., 2000. Comparison of discrimination methods for the classification of tumours...
  • Dong, L., Frank, E., Kramer, S., 2005. Ensembles of Balanced Nested Dichotomies for Multi-Class Problems. In: PKDD...
  • L. Ein-Dor

    Outcome signature genes in breast cancer: is there a unique set?

    Bioinformatics

    (2005)
  • Fisher, R.A., 1932. Statistical methods for research...
  • Frank, E., Hall, M., 2001. A simple approach to ordinal classification. In: Proceedings of the 12th European Conference...
  • Frank, E., Hall, M., Pfahringer, B., 2003. Locally weighted naive Bayes. In: Proceedings of the 19th Conference in...
  • T.R. Golub et al.

    Molecular classification of cancer: class discovery and class prediction by gene expression monitoring

    Science

    (1999)
  • I. Guyon et al.

    Gene selection for cancer classification using support vector machines: an evaluation of gene selection methods for multi-class microarray data classification

    Mach. Learn.

    (2002)
  • Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H., 2009. The WEKA Data Mining Software: an...
  • Hall, M., Frank, E., 2008. Combining naive Bayes and decision tables. In: Proceedings of the 21st Florida Artificial...
  • P. Hedge

    A concise guide to cDNA microarray analysis

    Bio. Tech.

    (2000)
  • R. Herwig

    Large-scale clustering of cDNA-fingerprinting data

    Genome Res.

    (1999)
  • Holmes, G., Pfahringer, B., Kirkby, R., Frank, E., Hall, M., 2001. Multiclass alternating decision trees. In: ECML...
  • John, G.H., Langley, P., 1995. Estimating continuous distributions in Bayesian classifiers. In Proceedings of the...
  • Kaburlasos, V.G., Athanasiadis, I.N., Mitkas, P.A., Petridis, V., 2003. Fuzzy Lattice Reasoning (FLR) Classifier and...
  • KDnuggets, G.P., Lowell, U.M., Tamayo, P., 2004, Microarray Data Mining: Facing the Challenges. MIT/Broad Institute...
  • S.S. Keerthi et al.

    Improvements to Platt's SMO algorithm for SVM classifier design

    Neural Comput.

    (2001)
  • Cited by (42)

    • Microarray cancer feature selection: Review, challenges and research directions

      2020, International Journal of Cognitive Computing in Engineering
    • Variable selection in classification for multivariate functional data

      2019, Information Sciences
      Citation Excerpt :

      Particularly, some references in very high-dimensional problems such as cancer detection via gene expression data, must be mentioned. The work of [36] presents a theoretical and practical framework for feature selection based on a conditional mutual information criterion. [35,50] focus on the chemotherapy effectiveness problems solved by means of ranking (SVM-RFE) and fuzzy if-then rules, respectively.

    View all citing articles on Scopus
    View full text