Genetic programming for anticancer therapeutic response prediction using the NCI-60 dataset
Introduction
We investigate the usefulness of genetic programming (GP) [23], [43] for understanding the functional relationship between gene expressions1 and therapeutic response to four clinical agents: Fluorouracil (5-FU), Floxuridine, Fludarabine and Cytarabine. We use the NCI-60 microarray dataset [8], [11], [38], a panel of 60 cell lines derived from several different cancer types, including leukemias, melanomas, ovarian, renal, prostate, colon, lung and CNS cancers.
GP is an evolutionary approach which extends the genetic model of learning to the space of programs. It is a major variation of genetic algorithms (GAs) [18], [15] in which the evolving individuals are themselves computer programs instead of fixed length strings from a limited alphabet of symbols. In the last few years, GP has become popular for biomedical applications. In particular, GP has been recently used to mine large datasets with the goal of correlating the behavior of latent features with some interesting parameters bound to drug activity patterns. For instance, in [44] GP has been used to classify drug-like molecules in terms of their bioavailability. In [1] it has been used for quantitative prediction of drug induced toxicity. In [48] GP has been applied to cancer expression profiling data to select feature genes and build molecular classifiers by mathematical integration of these genes. In [32] the usefulness of GP to attribute selection and classification in human genetics has been discussed. GP can be regarded as an optimization method, which makes no assumption on the objective functions and the data. Furthermore, as pointed out in [1], GP often automatically performs a feature selection, proposing solutions that use subsets of data. Thus, the motivation behind our choice of using GP is twofold:
Biological data, like gene expression levels, are not independent of each other. Rather, small subsets of genes and molecules work in cohesion [6]. These phenomena lead to high multidependency among the features. Hence, the underlying algorithm should be capable of extracting features from high-dimensional correlated data.
The dimensionality of the feature space in biomedical datasets is normally much higher than the number of observations available for training. Hence, automatic feature selection as well as other methods to handle overfitting and minimizing the generalization error should be encouraged.
Section 2 introduces modeling microarray data to therapeutic response, discussing previous and related work. Section 3 presents GP and its use in this work (a more detailed discussion of GP functioning can be found in Appendix A). In Section 4 the machine learning methods used for comparison with GP are discussed. In Section 5 we describe the method employed to build the dataset used in our experiments. In Section 6 we discuss experimental results. Finally, Section 7 concludes the paper and offers hints for future research.
Section snippets
DNA microarrays
DNA microarrays have dramatically accelerated many types of investigations in many fields of medicine, bioinformatics and systems biology [29], [2]. The advantages in microarrays technology and the growing availability of biological measurements performed at molecular level have intensified the role of machine learning methods for effective cancer prediction and classification. These measurements are represented by the expression levels of thousands of genes exhibited in different kind of
Genetic programming
GP [23], [43] is an evolutionary approach which extends GAs [18], [15] to the space of programs. Like any other evolutionary algorithm, GP works by defining a goal in the form of a quality criterion (or fitness) and then using this criterion to evolve a set (also called population) of solution candidates (also called individuals) by mimic the basic principles of Darwin evolution theory [4]. The most common version of GP, and also the one used here, considers individuals as LISP-like tree
Non-evolutionary regression methods used
In order to compare results returned by GP, we have also used linear regression and least square regression on the same datasets. These methods are described here synthetically, since they are well-known and well-established regression techniques. Furthermore, they have also been used after a preprocessing phase in which two well-known feature selection algorithms have been employed. They are briefly described in the next paragraph. For more details on these methods and algorithms and their
Dataset
The NCI-60 dataset [8], [11], [38] consists of 60 human cancer cell lines from nine different kinds of cancers: colorectal, renal, ovarian, breast, prostate, central nervous system, leukaemias and melanoma. The gene expression profile was measured for 9703 genes but only 1375, which show strong variation among the cell lines, are retained for analysis. Expression data for genes are stored in an matrix , where is the number of samples. Each element of the matrix and
Experimental results
Each one of our four datasets (see Section 5) can be represented by matrices where and . Each line represents a gene expression whose known value of the therapeutic response to the chosen drug (Fluorouracil, Fludarabine, Floxuridine and Cytarabine, respectively) has been placed at position . Thus, the last column of matrix contains the known values of the parameter to estimate. The four matrices representing the dataset of each drug differ only in the last
Conclusions and future work
A genetic programming (GP)-based framework for predicting patients anticancer therapeutic response, based on the NCI-60 microarray dataset [8], [11], [38] has been presented. We have investigated the relationship between patients’ gene expressions and their response to oncology drugs Fluorouracil, Fludarabine, Floxuridine and Cytarabine. GP has been shown an effective technique from the point of view of the accurateness of the solutions proposed, of the generalization capabilities and of the
Acknowledgments
A preliminary study about anticancer therapeutic responses using the NCI-60 dataset has already been presented at the SysBioHealth International Symposium held at the University of Milano-Bicocca in October 2007 [14]. We gratefully acknowledge the reviewers for their excellent remarks that helped us improve the quality of our contribution and the Symposium's organizers for honoring Dr. Ilaria Giordani with the Young Investigator Award.
References (48)
- et al.
Shortcuts in genome-scale cancer pharmacology research from multivariate analysis of the National Cancer Institute gene expression database
Biochemical pharmacology
(2001) - et al.
Extracting gene regulation information for cancer classification
Pattern Recognition
(2007) - et al.
Genetic programming for computational pharmacokinetics in drug discovery and development
Genetic Programming and Evolvable Machines
(2007) - et al.
One-stop shop for microarray data
Nature
(2000) - et al.
Frequent subtree mining—an overview
Fundamenta Informaticae
(2005) On the origin of species by means of natural selection
(1859)- Das R, Mitra S, Banka H, Mukhopadhyay S. Evolutionary biclustering with correlation for gene interaction networks. In:...
- Dasgupta N, Lin SM, Carin L. Modeling pharmacogenomics of the NCI-60 anticancer data set: utilizing kernel PLS to...
Gene expression signature in advanced colorectal cancer patients select drugs and response for the use of Leucovorin, Fluorouracil, and Irinotecan
Journal of Clinical Oncology
(2007)Systematic variation in gene expression patterns in human cancer cell lines
Nature Genetics
(2000)
Genome-wide cDNA microarray screening to correlate gene expression profiles with sensitivity of 85 human cancer xenografts to anticancer drugs
Cancer Research
Comparative study of gene expression by cDNA microarray in human colorectal cancer tissues and normal mucosa
International Journal of Oncology
A gene expression database for the molecular pharmacology of cancer
Nature Genetics
Predicting cancer drug response by proteomic profiling
Clinical Cancer Research
Genetic algorithms in search, optimization and machine learning
Molecular classification of cancer: class discovery and class prediction by gene expression monitoring
Science
Information theory and an extension of maximum likelihood principle
Adaptation in natural and artificial systems
Feature selection and classification for microarray data analysis: evolutionary methods for identifying predictive genes
BMC Bioinformatics
Scaled symbolic regression
Genetic Programming and Evolvable Machines
Cited by (16)
Induction and repair of DNA double-strand breaks using constant-field gel electrophoresis and apoptosis as predictive markers for sensitivity of cancer cells to cisplatin
2012, Biomedicine and PharmacotherapyCitation Excerpt :Recent studies in this area have demonstrated that various mechanisms are involved in the resistance to chemotherapeutic agents such as decreased drug transport inside the tumor cells or enhanced drug efflux, increased capability of cancer cells to stop cycling and repair therapy-induced DNA damage, an increased level of radical scavengers, a decreased level of apoptosis or changes in the level of drug target [5,6]. Recently, gene expression profiling has been applied to predict the response of different types of cancer to anticancer drugs [7,8]. Alternative measures of sensitivity have been developed using different DNA damage assays, including the comet assay, constant-field gel electrophoresis (CFGE), graded-field gel electrophoresis (GFGE), pulsed-field gel electrophoresis (PFGE) and the immunofluorescence or flow cytometric measurement of γ–H2AX.
Establishing a knowledge trail from molecular experiments to clinical trials
2011, New BiotechnologyCitation Excerpt :The relatively recent creation and adherence to data standards in the ‘omics field [6–10] allows data sharing, which is strongly encouraged by funders and journals [11]. Data sharing has already yielded promising results [12,13]. This makes the possibility of formally linking these fields to assist in the study of systems biology [14,15] a reality.
Optimal Supervised Reduction of High Dimensional Transcription Data
2023, IEEE/ACM Transactions on Computational Biology and BioinformaticsAn investigation of geometric semantic gp with linear scaling
2023, GECCO 2023 - Proceedings of the 2023 Genetic and Evolutionary Computation ConferenceRobust statistical boosting with quantile-based adaptive loss functions
2023, International Journal of BiostatisticsA computational model for anti-cancer drug sensitivity prediction
2019, BioCAS 2019 - Biomedical Circuits and Systems Conference, Proceedings