Elsevier

Applied Soft Computing

Volume 10, Issue 1, January 2010, Pages 170-182
Applied Soft Computing

Genetic programming for QSAR investigation of docking energy

https://doi.org/10.1016/j.asoc.2009.06.013Get rights and content

Abstract

Statistical methods, and in particular Machine Learning, have been increasingly used in the drug development workflow to accelerate the discovery phase and to eliminate possible failures early during clinical developments. In the past, the authors of this paper have been working specifically on two problems: (i) prediction of drug induced toxicity and (ii) evaluation of the target–drug chemical interaction based on chemical descriptors. Among the numerous existing Machine Learning methods and their application to drug development (see for instance [F. Yoshida, J.G. Topliss, QSAR model for drug human oral bioavailability, Journal of Medicinal Chemistry 43 (2000) 2575–2585; Frohlich, J. Wegner, F. Sieker, A. Zell, Kernel functions for attributed molecular graphs—a new similarity based approach to ADME prediction in classification and regression, QSAR and Combinatorial Science, 38(4) (2003) 427–431; C.W. Andrews, L. Bennett, L.X. Yu, Predicting human oral bioavailability of a compound: development of a novel quantitative structure–bioavailability relationship, Pharmacological Research 17 (2000) 639–644; J Feng, L. Lurati, H. Ouyang, T. Robinson, Y. Wang, S. Yuan, S.S. Young, Predictive toxicology: benchmarking molecular descriptors and statistical methods, Journal of Chemical Information Computer Science 43 (2003) 1463–1470; T.M. Martin, D.M. Young, Prediction of the acute toxicity (96-h LC50) of organic compounds to the fat head minnow (Pimephales promelas) using a group contribution method, Chemical Research in Toxicology 14(10) (2001) 1378–1385; G. Colmenarejo, A. Alvarez-Pedraglio, J.L. Lavandera, Chemoinformatic models to predict binding affinities to human serum albumin, Journal of Medicinal Chemistry 44 (2001) 4370–4378; J. Zupan, P. Gasteiger, Neural Networks in Chemistry and Drug Design: An Introduction, 2nd edition, Wiley, 1999]), we have been specifically concerned with Genetic Programming. A first paper [F. Archetti, E. Messina, S. Lanzeni, L. Vanneschi, Genetic programming for computational pharmacokinetics in drug discovery and development, Genetic Programming and Evolvable Machines 8(4) (2007) 17–26] has been devoted to problem (i). The present contribution aims at developing a Genetic Programming based framework on which to build specific strategies which are then shown to be a valuable tool for problem (ii). In this paper, we use target estrogen receptor molecules and genistein based drug compounds. Being able to precisely and efficiently predict their mutual interaction energy is a very important task: for example, it may have an immediate relationship with the efficacy of genistein based drugs in menopause therapy and also as a natural prevention of some tumors. We compare the experimental results obtained by Genetic Programming with the ones of a set of “non-evolutionary” Machine Learning methods, including Support Vector Machines, Artificial Neural Networks, Linear and Least Square Regression. Experimental results confirm that Genetic Programming is a promising technique from the viewpoint of the accuracy of the proposed solutions, of the generalization ability and of the correlation between predicted data and correct ones.

Introduction

The goal of this paper is to investigate the usefulness of Genetic Programming (GP) [9], [10] for automatically generating the underlying functional relationship between a set of molecular descriptors of drug-like compounds and their value of the interaction, or docking, energy with a particular estrogen receptor. Being able to develop automatic computer systems to successfully and efficiently predict the mutual interaction energy between drug-like compounds and estrogen receptors would have a great impact, given that this interaction energy has an immediate relationship with the efficacy of those drugs.

GP is an evolutionary approach which extends the genetic model of learning to the space of programs. It is a major variation of Genetic Algorithms [11], [12] in which the evolving individuals are themselves computer programs instead of fixed length strings from a limited alphabet of symbols. In the last few years, GP has become more and more popular for biomedical and pharmacokinetic applications. In particular, GP has been recently used to mine large datasets with the goal of automatically generating the underlying (hidden) functional relationship between data and correlate the behavior of latent features with some interesting pharmacokinetic parameters bound to drug activity patterns. For instance, in [13] GP has been used to classify drug-like molecules in terms of their bioavailability, in [14] it has been used with mutual information methods for analyzing complex molecular data, in [8] it has been used for quantitative prediction of drug induced toxicity and in [15] it has been applied to cancer expression profiling data to select features and build molecular classifiers by mathematical integration of genes.

GP can be regarded as an optimization method, which makes no assumption on the objective functions and data. Furthermore, as pointed out in [8] and explained in details also further in this paper, GP often automatically performs a feature selection, maintaining into the population expressions that use subsets of data. Thus, the motivation behind our choice of investigating the usefulness of GP for assessing large biomedical datasets is twofold:

  • biological/chemical data are not independent of each other. Rather, it has been verified that in most of the complex biochemical systems, small subsets of components work in cohesion [16]. These phenomena lead to high multi-dependency among the features. Hence, the underlying algorithm should make no assumption on the inter-dependencies between the different variables. Furthermore, the algorithm should be capable of extracting underlying features governing the biochemical reactions from high-dimensional correlated data.

  • The dimensionality of the feature space in biomedical datasets is normally much higher than the number of observations available for training. Hence, automatic feature selection as well as other methods to handle overfitting and minimizing the generalization error should be encouraged.

Pharmacokinetics prediction tools are usually based on two approaches: molecular modelling, which uses intensive protein structure calculations and data modelling. Methods based on data modelling are widely reported in literature; they all belong to the category of Quantitative Structure Activity Relationship (QSAR) models [17] and they are adopted in the present work. To quantify the real usefulness of GP for the presented application, experimental results are compared with the ones of a set of well-known Machine Learning (ML) methods, including Support Vector Machines (SVM), Artificial Neural Networks, Linear and Least Square Regression. These ones will be referred to as “non-evolutionary” methods for simplicity.

This paper is structured as follows: Section 2 discusses previous and related work; in Section 3 we describe the method employed to build the dataset used in our experiments; Section 4 briefly describes the non-evolutionary ML methods used in this paper and discusses their experimental results on our dataset; in Section 5 we introduce the different versions of GP that we have tested in this work and we discuss their experimental results; Section 6 contains the description of a method to improve GP results for the studied problem; finally Section 8 concludes the paper and offers hints for future research.

Section snippets

Previous and related work

As outlined above, the goal of this paper is investigating the usefulness of GP for generating the hidden relationship between molecular descriptors and docking energy. Virtual molecular docking represents a basic step in rational drug design. Its objective is to predict how any macromolecules (typically a protein or nucleic acid) interact with other molecules called “ligands” (may be other proteins, peptides or small drug-like molecules) by calculating their interaction energy in some

Dataset

We have collected from the RCSB PDB database [33] a small set of estrogen–genistein virtual molecules. Successively we have defined substitution points on which we have clasped a small database of substituents (OH, CH3, CH2CH3, CH2OH, CH2CH2OH, CH2CH2NH2, OCH2CH2NH2), obtaining a set of 992 genistein based virtual molecules. The resulting chemical structures where then optimized by means of molecular mechanics using the MOE software [34] and MMFF94 force field [35] for calculating 267 molecular

Non-evolutionary methods

To assess our dataset, we have used a set of regression methods. For simplicity, we partition them into two broad classes we call non-evolutionary methods and GP methods. GP methods basically consist in some variants of the standard version of tree-based GP and will be described in the next sections. In this section, we present the non-evolutionary methods we have used and we discuss their experimental results.

GP methods

In this section we describe the GP versions that we have used and discuss their experimental results. Configurations and parameters have been tuned by a set of experiments, in which many possible alternatives have been tested. The different configurations that have been tested are discussed in Appendix A.

We have used a tree-based GP configuration for regression problems inspired by [9], [43], [44]. Each molecular feature has been represented as a floating point number. Potential solutions (GP

Search for recurrent patterns

In all the experiments that we have presented until now we have empirically observed that “good” individuals often shared some common structures (i.e. subtrees or “parts” of subtrees). For this reason, for each one of the 100 independent LinScalGP 2 runs discussed in Section 5.3, we have considered the individual with the best RMSE on the test set and the one with the best CC

Further experiments

In the present section, we describe further details about the experiments whose results have been described previously in this manuscript. In particular, Section 7.1 contains an analysis of the CPU completion times of the various Machine Learning methods, in order to further establishing the utility of the proposed method, and Section 7.2 reports the average RMSE and CC along with generations for the different studied GP variants.

Conclusions and future work

Machine Learning methods, including various versions of Genetic Programming, have been employed for assessing and predicting the value of the docking energy of genistein based drug compounds with estrogen receptor proteins. This application is important since the ability of correctly predicting this value could help us selecting the most promising genistein based drugs for menopause therapy and also as a natural prevention of some tumors. Genetic Programming using linear scaling for optimizing

Acknowledgments

We acknowledge DELOS Srl [26] for allowing us to use their software environment.

References (49)

  • M. Pintore et al.

    Prediction of oral bioavailability by adaptive fuzzy partitioning

    European Journal of Medicinal Chemistry

    (2003)
  • N. Greene

    Computer systems for the prediction of toxicity: an update

    Advances in Drug Delivery Reviews

    (2002)
  • F. Yoshida et al.

    QSAR model for drug human oral bioavailability

    Journal of Medicinal Chemistry

    (2000)
  • J. Frohlich et al.

    Kernel functions for attributed molecular graphs—a new similarity based approach to ADME prediction in classification and regression

    QSAR and Combinatorial Science

    (2003)
  • C.W. Andrews et al.

    Predicting human oral bioavailability of a compound: development of a novel quantitative structure-bioavailability relationship

    Pharmacological Research

    (2000)
  • J. Feng et al.

    Predictive toxicology: benchmarking molecular descriptors and statistical methods

    Journal of Chemical Information Computer Science

    (2003)
  • T.M. Martin et al.

    Prediction of the acute toxicity (96-h LC50) of organic compounds to the fathead minnow (Pimephales promelas) using a group contribution method

    Chemical Research in Toxicology

    (2001)
  • G. Colmenarejo et al.

    Chemoinformatic models to predict binding affinities to human serum albumin

    Journal of Medicinal Chemistry

    (2001)
  • J. Zupan et al.

    Neural Networks in Chemistry and Drug Design: An Introduction

    (1999)
  • F. Archetti et al.

    Genetic programming for computational pharmacokinetics in drug discovery and development

    Genetic Programming and Evolvable Machines

    (2007)
  • J.R. Koza

    Genetic Programming

    (1992)
  • L. Vanneschi theory and practice for efficient genetic programming. PhD thesis, Faculty of Sciences, University of...
  • J.H. Holland

    Adaptation in Natural and Artificial Systems

    (1975)
  • D.E. Goldberg

    Genetic Algorithms in Search, Optimization and Machine Learning

    (1989)
  • W.B. Langdon et al.

    Genetic programming in data mining for drug discovery

    Evolutionary Computing in Data Mining

    (2004)
  • V. Venkatraman et al.

    Evaluation of mutual information and genetic programming for feature selection in QSAR

    Journal of Chemical Information and Compututer Sciences

    (2004)
  • J. Yu et al.

    Feature selection and molecular classification of cancer using genetic programming

    Neoplasia

    (2007)
  • N. Dasgupta et al.

    Modeling pharmacogenomics of the NCI-60 anticancer data set: utilizing kernel PLS to correlate the microarray data to therapeutic responses

    Methods of Microarray Data Analysis II

    (2002)
  • H. Van de Waterbeemd et al.

    The Practice of Medicinal Chemistry

    (2003)
  • D.B. Kitchen et al.

    Docking and scoring in virtual screening for drug discovery: methods and applications

    Nature Reviews Drug Discovery

    (2004)
  • E.M. Krovat et al.

    Recent advances in docking and scoring

    Current Computer: Aided Drug Design

    (2005)
  • J.M. Banley et al.

    A good ligand is hard to find: automated docking methods of special interest

    Perspectives of Drug Discovery and Design

    (1993)
  • J.S. Dixon

    Flexible docking of ligands to receptor sites using genetic algorithms

  • C.M. Oshiro et al.

    Flexible ligand docking using a genetic algorithm

    Journal of Computer-Aided Molecular Design

    (1995)
  • Cited by (18)

    • Bio-inspired optimization for the molecular docking problem: State of the art, recent results and perspectives

      2019, Applied Soft Computing Journal
      Citation Excerpt :

      Furthermore, knowledge-based and descriptor-based functions are notably less utilized in the field. This can be useful as an overview for computer scientists to apply meta-heuristics using new evaluation methods such as the study provided by [87], where the authors applied a new QSAR-based-energy function and a search method based on Genetic Programming (GP). When it comes to the algorithmic design of the search method, our survey has identified several meta-heuristic solvers being used in the literature, including the Monte Carlo algorithm, the family of Evolutionary Computation (GA, DE and others alike), Tabu Search, Simulated Annealing, Local Search methods, and Swarm Intelligence methods (PSO, ACO and ABC, among others).

    • Modeling of arsenic, chromium and cadmium removal by nanofiltration process using genetic programming

      2012, Applied Soft Computing Journal
      Citation Excerpt :

      Furthermore, 10% of experimental data were used as test data for evaluation of performance of each model. The search space of GP is virtually unlimited and programs tend to grow in size during the evolutionary process [33,34]. Code growth is a healthy result of genetic operators in search of better solutions, but it also permits the appearance of pieces of redundant codes that increase the size of programs without improving their fitness, a phenomena known as bloating.

    • Solving symmetric eigenvalue problem via genetic algorithms: Serial versus parallel implementation

      2011, Applied Soft Computing Journal
      Citation Excerpt :

      However, these methods have been hardly explored in handling problems of computational linear algebra – specially the eigenvalue problem which is at the focus of the present work. Of the non-deterministic optimizers mentioned GA has been most successfully utilized in chemical and physical sciences in exploring complex energy landscapes in atomic and molecular clusters [28–33], and in solving low-dimensional Schrodinger equation [34–42], in pulse shaping [43], drug designing [44], etc. We had earlier explored the possibility of using a GA-Rayleigh Quotient method for sequentially extracting eigenvalues and eig- envectors of a fixed real-symmetric matrix [46].

    View all citing articles on Scopus
    View full text