A scalable and accurate method for classifying protein–ligand binding geometries using a MapReduce approach

https://doi.org/10.1016/j.compbiomed.2012.05.001Get rights and content

Abstract

We present a scalable and accurate method for classifying protein–ligand binding geometries in molecular docking. Our method is a three-step process: the first step encodes the geometry of a three-dimensional (3D) ligand conformation into a single 3D point in the space; the second step builds an octree by assigning an octant identifier to every single point in the space under consideration; and the third step performs an octree-based clustering on the reduced conformation space and identifies the most dense octant. We adapt our method for MapReduce and implement it in Hadoop. The load-balancing, fault-tolerance, and scalability in MapReduce allow screening of very large conformation spaces not approachable with traditional clustering methods. We analyze results for docking trials for 23 protein–ligand complexes for HIV protease, 21 protein–ligand complexes for Trypsin, and 12 protein–ligand complexes for P38alpha kinase. We also analyze cross docking trials for 24 ligands, each docking into 24 protein conformations of the HIV protease, and receptor ensemble docking trials for 24 ligands, each docking in a pool of HIV protease receptors. Our method demonstrates significant improvement over energy-only scoring for the accurate identification of native ligand geometries in all these docking assessments. The advantages of our clustering approach make it attractive for complex applications in real-world drug design efforts. We demonstrate that our method is particularly useful for clustering docking results using a minimal ensemble of representative protein conformational states (receptor ensemble docking), which is now a common strategy to address protein flexibility in molecular docking.

Section snippets

Introduction and motivation

Cutting-edge distributed technologies, such as cloud and volunteer computing, provide scientists with an efficient and scalable way to perform computationally expensive simulations at a rate never seen before. However, this new capability to perform longer simulations presents new challenges for scientists who have to deal with the analysis, sorting, and selection of scientifically meaningful results from the massive amounts of data collected. Clustering techniques are an effective approach

Protein–ligand docking, cross docking, and receptor ensemble docking

The computational search for potential drug-like lead molecules in virtual screening relies on molecular protein–ligand docking to simulate the docking of small molecules (also called ligands) into proteins involved in the disease process. Protein–ligand docking is a search with uncertainties in a very large space of potential docking conformations; this space is shaped by the protein, the ligand, the computational methods, and the degrees of freedom to be explored [17]. Given a protein,

Methodology

We propose a novel method to identify 3D ligand conformations docked into one or multiple protein conformations and the method scalable implementation using MapReduce. The load balancing, fault tolerance, and scalability in MapReduce make the method attractive to exhaustively screen the large resulting space of ligand conformations which is difficult by traditional clustering methods.

Test set-up

We collect the dataset for testing our proposed method for classifying binding geometries by using the D@H project. On D@H, we ran docking trials for 23 protein–ligand complexes for HIV protease (an aspartic acid protease protein), 21 protein–ligand complexes for Trypsin (a serine protease protein), and 12 protein–ligand complexes for P38alpha kinase (a serine/threonine kinase protein). We also ran cross docking trials for 24 ligands, each docking into 24 protein conformations of the HIV

Related work

Exploring the search space of docking conformations has been approached using a variety of techniques including data analytics and clustering. Analytic approaches usually select one or multiple conformations that are likely to be near-native at runtime and then perform an extensive sampling around the predicted conformations. Important work in this direction includes Yang et al. [32] and Liang et al. [21]. These approaches improve the accuracy of docking methods and increase the probability of

Conclusions

In protein–ligand docking, accurately ranking a series of ligand conformations (scoring) is important to successfully predict whether a given ligand will bind to one protein more favorably than others. It is acknowledged that energy-based scoring methods are error-prone and that traditional clustering methods based on geometries are not scalable. Still, protein–ligand docking simulations are delivering increasingly larger datasets of ligand conformations, and accurate solutions that are also

Conflict of interest statement

Roger Armen's CoI:

PI's Collaborators and Co-Editors (Past 48 Months)

Collaborators: C.L. Brooks III (U Michigan), A. Mapp (U Michigan), M. Taufer (U Delaware), D.J. Doren (U Delaware), T.O. Chan (TJU), U. Rodeck (TJU), J.M. Pascal (TJU), J.Y. Cheung (Temple), A.M. Feldman (Temple), J.L. Benovic (TJU), C.P. Scott (TJU), R.A. Panettieri (U Penn), S.B. Liggett (U Maryland), R.B. Penn (U Maryland), B. Lu (TJU) A.P. Dicker (TJU) J.F. Zhang (TJU).

PI's Graduate Advisors

Valerie Daggett (University of

Acknowledgments

This work was supported by the NSF IIS #0968350 entitled Collaborative Research: SoCS - ExSciTecH: An Interactive, Easy-to-Use Volunteer Computing System to Explore Science, Technology, and Health and by the NSF OCI Cooperative Agreement #0910847 entitled Flash Gordon: A Data Intensive Computer. We used Trestles and Gordon-ION resources of Teragrid and XSEDE that are provided by SDSC.

The authors thank Joshua Bernstein (Penguin Computing Inc.) for his help in installing and setting Hadoop on our

References (36)

  • M. Totrov et al.

    Flexible ligand docking to multiple receptor conformations: a practical alternative

    Curr. Opin. Struct. Biol.

    (2008)
  • M.K. Gilson et al.

    The statistical-thermodynamic basis for computation of binding affinities: a critical review

    J. Biophys.

    (1997)
  • M. Rarey et al.

    A fast flexible docking method using an incremental construction algorithm

    J. Mol. Biol.

    (1996)
  • R. Abagyan et al.

    A new method for protein modeling and design: applications to docking and structure prediction from the distorted native conformation

    J. Comput. Chem.

    (1996)
  • D.P. Anderson, BOINC: a system for public-resource computing and storage, in: Proceedings of the Fifth IEEE/ACM...
  • G. Bouvier et al.

    Automatic clustering of docking poses in virtual screening process using self-organising map

    Bioinf. Adv. Access

    (2009)
  • B.R. Brooks et al.

    CHARMM: a program for macromolecular energy minimization, and dynamics calculations

    J. Comput. Chem.

    (1983)
  • B.D. Bursulaya et al.

    Comparative study of several algorithms for flexible ligand docking

    J. Comp. Aided Mol. Des.

    (2003)
  • M.W. Chang et al.

    Empirical entropic contributions in computational docking: evaluation in APS reductase complexes

    J. Comput. Chem.

    (2008)
  • R.L.F. Cordeiro, C. Traina, Jr., A.J.M. Traina, J. López, U. Kang, C. Faloutsos, Clustering very large...
  • J. Dean, S. Ghemawat, MapReduce: simplified data processing on large clusters, in: Proceedings of the Sixth conference...
  • A. Ene, S. Im, B. Moseley, Fast clustering using MapReduce, in: Proceedings of the 17th ACM SIGKDD International...
  • T. Estrada, R. S. Armen, M. Taufer, Automatic selection of near-native protein–ligand conformations using a...
  • T. Estrada, B. Zhang, P. Cicotti, R.S. Armen, M. Taufer, Reengineering high-throughput molecular datasets for scalable...
  • P. Ferrara et al.

    Assessing scoring functions for protein–ligand interactions

    J. Med. Chem.

    (2004)
  • P.C.D. Hawkins et al.

    How to do an evaluation: pitfalls and traps

    J. Comp. Aided Mol. Des.

    (2008)
  • V. Hnizdo et al.

    Efficient calculation of configurational entropy from molecular simulations by combining the mutual-information expansion and nearest-neighbor methods

    J. Comput. Chem.

    (2008)
  • A. Jain

    Bias, reporting, and sharing: computational evaluations of docking methods

    J. Comp. Aided Mol. Des.

    (2008)
  • Cited by (26)

    • Big data handling mechanisms in the healthcare applications: A comprehensive and systematic literature review

      2018, Journal of Biomedical Informatics
      Citation Excerpt :

      However, this method has a low level of training efficiency in certain size sample. A scalable and accurate method for classifying the protein-ligand binding geometries has been done by Estrada [54] in molecular docking. The first step of this method is to encode of the geometry of a three-dimensional ligand adaptation into a single three-dimensional point in the space.

    • An optimal big data workflow for biomedical image analysis

      2018, Informatics in Medicine Unlocked
      Citation Excerpt :

      MapReduce programming is a special form of a directed acyclic graph (DAG) which is applicable to a wide range of used cases. MapReduce is organized in two functions [51,52]. The first one is a Map function, which transforms an element of data into some number of key/value pairs.

    • Enabling scalable and accurate clustering of distributed ligand geometries on supercomputers

      2017, Parallel Computing
      Citation Excerpt :

      Traditionally, docked conformations with minimum energy are assumed to be near-native. Research has shown, however, that this is not always the case [5]. Since selecting the near-native ligand geometry based on energy alone may result in incorrect conclusions, an alternative approach selects the near-native geometry from clustering.

    • Memory-Efficient and Skew-Tolerant MapReduce over MPI for Supercomputing Systems

      2020, IEEE Transactions on Parallel and Distributed Systems
    View all citing articles on Scopus
    1

    T. Estrada and B. Zhang have contributed equally to this work.

    View full text