A scalable and accurate method for classifying protein–ligand binding geometries using a MapReduce approach

doi:10.1016/j.compbiomed.2012.05.001

Computers in Biology and Medicine

Volume 42, Issue 7, July 2012, Pages 758-771

https://doi.org/10.1016/j.compbiomed.2012.05.001 Get rights and content

Abstract

We present a scalable and accurate method for classifying protein–ligand binding geometries in molecular docking. Our method is a three-step process: the first step encodes the geometry of a three-dimensional (3D) ligand conformation into a single 3D point in the space; the second step builds an octree by assigning an octant identifier to every single point in the space under consideration; and the third step performs an octree-based clustering on the reduced conformation space and identifies the most dense octant. We adapt our method for MapReduce and implement it in Hadoop. The load-balancing, fault-tolerance, and scalability in MapReduce allow screening of very large conformation spaces not approachable with traditional clustering methods. We analyze results for docking trials for 23 protein–ligand complexes for HIV protease, 21 protein–ligand complexes for Trypsin, and 12 protein–ligand complexes for P38alpha kinase. We also analyze cross docking trials for 24 ligands, each docking into 24 protein conformations of the HIV protease, and receptor ensemble docking trials for 24 ligands, each docking in a pool of HIV protease receptors. Our method demonstrates significant improvement over energy-only scoring for the accurate identification of native ligand geometries in all these docking assessments. The advantages of our clustering approach make it attractive for complex applications in real-world drug design efforts. We demonstrate that our method is particularly useful for clustering docking results using a minimal ensemble of representative protein conformational states (receptor ensemble docking), which is now a common strategy to address protein flexibility in molecular docking.

Section snippets

Introduction and motivation

Cutting-edge distributed technologies, such as cloud and volunteer computing, provide scientists with an efficient and scalable way to perform computationally expensive simulations at a rate never seen before. However, this new capability to perform longer simulations presents new challenges for scientists who have to deal with the analysis, sorting, and selection of scientifically meaningful results from the massive amounts of data collected. Clustering techniques are an effective approach

Protein–ligand docking, cross docking, and receptor ensemble docking

The computational search for potential drug-like lead molecules in virtual screening relies on molecular protein–ligand docking to simulate the docking of small molecules (also called ligands) into proteins involved in the disease process. Protein–ligand docking is a search with uncertainties in a very large space of potential docking conformations; this space is shaped by the protein, the ligand, the computational methods, and the degrees of freedom to be explored [17]. Given a protein,

Methodology

We propose a novel method to identify 3D ligand conformations docked into one or multiple protein conformations and the method scalable implementation using MapReduce. The load balancing, fault tolerance, and scalability in MapReduce make the method attractive to exhaustively screen the large resulting space of ligand conformations which is difficult by traditional clustering methods.

Test set-up

We collect the dataset for testing our proposed method for classifying binding geometries by using the D@H project. On D@H, we ran docking trials for 23 protein–ligand complexes for HIV protease (an aspartic acid protease protein), 21 protein–ligand complexes for Trypsin (a serine protease protein), and 12 protein–ligand complexes for P38alpha kinase (a serine/threonine kinase protein). We also ran cross docking trials for 24 ligands, each docking into 24 protein conformations of the HIV

Related work

Exploring the search space of docking conformations has been approached using a variety of techniques including data analytics and clustering. Analytic approaches usually select one or multiple conformations that are likely to be near-native at runtime and then perform an extensive sampling around the predicted conformations. Important work in this direction includes Yang et al. [32] and Liang et al. [21]. These approaches improve the accuracy of docking methods and increase the probability of

Conclusions

In protein–ligand docking, accurately ranking a series of ligand conformations (scoring) is important to successfully predict whether a given ligand will bind to one protein more favorably than others. It is acknowledged that energy-based scoring methods are error-prone and that traditional clustering methods based on geometries are not scalable. Still, protein–ligand docking simulations are delivering increasingly larger datasets of ligand conformations, and accurate solutions that are also

Conflict of interest statement

Roger Armen's CoI:

PI's Collaborators and Co-Editors (Past 48 Months)

Collaborators: C.L. Brooks III (U Michigan), A. Mapp (U Michigan), M. Taufer (U Delaware), D.J. Doren (U Delaware), T.O. Chan (TJU), U. Rodeck (TJU), J.M. Pascal (TJU), J.Y. Cheung (Temple), A.M. Feldman (Temple), J.L. Benovic (TJU), C.P. Scott (TJU), R.A. Panettieri (U Penn), S.B. Liggett (U Maryland), R.B. Penn (U Maryland), B. Lu (TJU) A.P. Dicker (TJU) J.F. Zhang (TJU).

PI's Graduate Advisors

Valerie Daggett (University of

Acknowledgments

This work was supported by the NSF IIS #0968350 entitled Collaborative Research: SoCS - ExSciTecH: An Interactive, Easy-to-Use Volunteer Computing System to Explore Science, Technology, and Health and by the NSF OCI Cooperative Agreement #0910847 entitled Flash Gordon: A Data Intensive Computer. We used Trestles and Gordon-ION resources of Teragrid and XSEDE that are provided by SDSC.

The authors thank Joshua Bernstein (Penguin Computing Inc.) for his help in installing and setting Hadoop on our

References (36)

M. Totrov et al.
Flexible ligand docking to multiple receptor conformations: a practical alternative
Curr. Opin. Struct. Biol.
(2008)
M.K. Gilson et al.
The statistical-thermodynamic basis for computation of binding affinities: a critical review
J. Biophys.
(1997)
M. Rarey et al.
A fast flexible docking method using an incremental construction algorithm
J. Mol. Biol.
(1996)
R. Abagyan et al.
A new method for protein modeling and design: applications to docking and structure prediction from the distorted native conformation
J. Comput. Chem.
(1996)
D.P. Anderson, BOINC: a system for public-resource computing and storage, in: Proceedings of the Fifth IEEE/ACM...
G. Bouvier et al.
Automatic clustering of docking poses in virtual screening process using self-organising map
Bioinf. Adv. Access
(2009)
B.R. Brooks et al.
CHARMM: a program for macromolecular energy minimization, and dynamics calculations
J. Comput. Chem.
(1983)
B.D. Bursulaya et al.
Comparative study of several algorithms for flexible ligand docking
J. Comp. Aided Mol. Des.
(2003)
M.W. Chang et al.
Empirical entropic contributions in computational docking: evaluation in APS reductase complexes
J. Comput. Chem.
(2008)
R.L.F. Cordeiro, C. Traina, Jr., A.J.M. Traina, J. López, U. Kang, C. Faloutsos, Clustering very large...

J. Dean, S. Ghemawat, MapReduce: simplified data processing on large clusters, in: Proceedings of the Sixth conference...

A. Ene, S. Im, B. Moseley, Fast clustering using MapReduce, in: Proceedings of the 17th ACM SIGKDD International...

T. Estrada, R. S. Armen, M. Taufer, Automatic selection of near-native protein–ligand conformations using a...

T. Estrada, B. Zhang, P. Cicotti, R.S. Armen, M. Taufer, Reengineering high-throughput molecular datasets for scalable...

P. Ferrara et al.

Assessing scoring functions for protein–ligand interactions

J. Med. Chem.

(2004)

P.C.D. Hawkins et al.

How to do an evaluation: pitfalls and traps

J. Comp. Aided Mol. Des.

(2008)

V. Hnizdo et al.

Efficient calculation of configurational entropy from molecular simulations by combining the mutual-information expansion and nearest-neighbor methods

J. Comput. Chem.

(2008)

A. Jain

Bias, reporting, and sharing: computational evaluations of docking methods

J. Comp. Aided Mol. Des.

(2008)

Cited by (26)

The growing role of integrated and insightful big and real-time data analytics platforms
2020, Advances in Computers
Digitization era is altering several industries which include the way in which the data is analyzed and it is inferred that about 2.7 Zettabytes of data exist in the digital world today. By 2020 the data generated per second for every human being will approximate amount to 1.7 megabytes and the volume of data would double every 2 years thus reach the 40 ZB point by 2020. Interactive Data Corporation (IDC) estimated that by the end of year 2020, the e-commerce transactions B2B and B2C will hit 450 billion per day on the internet.
The advent of Big and real time Data has triggered disruptive changes in many fields and the exploding volume of different sources of data like heterogeneous data, data integration, spatio-temporal correlation of data, batch analytics and real-time analytics, data sharing, semantic interoperability requires the development of a scalable platform that can fuse multiple data layers to handles the data intelligently.
In Big Data approaches, the challenge is not anymore to collect the data, but to draw valuable conclusions by properly analyzing them. The growth in Unstructured Data generated by business is irrefutable and they are under more pressure to preserve it for longer periods of time. To be clear, exploiting the collected data has been always considered by practitioners and researchers, but the huge velocity, heterogeneity and enormity of massive stream of real-time data shove the limits of the current storage, management and processing capabilities.
Admittedly, the traditional method of Extract, Transform and Load (ETL) are challenged and cannot be applied on the emerging opportunistically and crowed sensed data streams. Some of these data streams are structured in a way that serve only one predefined purpose and cannot be directly used for other means. Yet, there are emerging unstructured data such as context-based data from the internet and social media as well as credit card transactions that is not clear if they can be used to better understand the mobility patterns.
The analytical company Gartner states that by 2020 there will be over 26 billion interconnected devices. It is obvious, that they will produce massive amounts of meaningful data. Those data can be used for many applications such as real-time industrial equipment monitoring, traffic planning, automated maintenance, etc. Therefore, it is essential to develop modern system abstractions that allow us to resourcefully process huge and new data streams. This enormous amount of data urges the growth of integrated and insightful big and real-time data analytics Platforms.
The upcoming contemporary technology like digital twin, integrates historical data from past machine usage to the current data. It uses sensors to collect the real-time data, working status and other operational data attached to the physical model. These components send the relevant data via a cloud-based system to the other side of the bridge with the help of data analytics platform which produces the required insights. The big and real-time data analytics Platforms assist to perform useful operations on data analytics as a complete package. For this purpose, data analytics platform are used to acquire constructive insight from the huge volume of data.
Data analytics platform is an ecosystem of technologies and services that can help the businesses in increasing revenues, enhance operational efficiency, stabilize marketing campaigns and customer service efforts, respond more quickly to emerging market trends and gain a competitive edge over rivals. The data analytics platform finds the pattern and relationships in data by applying statistical techniques and communicates the results generated by analytical models to executives and end users to make decisions with the help of data visualization tools that display data on a single screen and can be updated in real time as new information becomes available. Big data and real-time data analytics platform supports the full spectrum of data types, protocols and integration to speed up and simplify the data wrangling process. The big data and real time platform provides accurate data, increase efficiency in the workspace, gives answers to complex questions along with security and hence it plays the key role in business analytics.
Big data handling mechanisms in the healthcare applications: A comprehensive and systematic literature review
2018, Journal of Biomedical Informatics
Citation Excerpt :
However, this method has a low level of training efficiency in certain size sample. A scalable and accurate method for classifying the protein-ligand binding geometries has been done by Estrada [54] in molecular docking. The first step of this method is to encode of the geometry of a three-dimensional ligand adaptation into a single three-dimensional point in the space.
Healthcare provides many services such as diagnosing, treatment, prevention of diseases, illnesses, injuries, and other physical and mental disorders. Large-scale distributed data processing applications in healthcare as a basic concept operates on large amounts of data. Therefore, big data application functions are the main part of healthcare operations, but there was not any comprehensive and systematic survey about studying and evaluating the important techniques in this field. Therefore, this paper aims at providing the comprehensive, detailed, and systematic study of the state-of-the-art mechanisms in the big data related to healthcare applications in five categories, including machine learning, cloud-based, heuristic-based, agent-based, and hybrid mechanisms. Also, this paper displayed a systematic literature review (SLR) of the big data applications in the healthcare literature up to the end of 2016. Initially, 205 papers were identified, but a paper selection process reduced the number of papers to 29 important studies.
An optimal big data workflow for biomedical image analysis
2018, Informatics in Medicine Unlocked
Citation Excerpt :
MapReduce programming is a special form of a directed acyclic graph (DAG) which is applicable to a wide range of used cases. MapReduce is organized in two functions [51,52]. The first one is a Map function, which transforms an element of data into some number of key/value pairs.
In the medical field, data volume is increasingly growing, and traditional methods cannot manage it efficiently. In biomedical computation, the continuous challenges are: management, analysis, and storage of the biomedical data. Nowadays, big data technology plays a significant role in the management, organization, and analysis of data, using machine learning and artificial intelligence techniques. It also allows a quick access to data using the NoSQL database. Thus, big data technologies include new frameworks to process medical data in a manner similar to biomedical images. It becomes very important to develop methods and/or architectures based on big data technologies, for a complete processing of biomedical image data.
This paper describes big data analytics for biomedical images, shows examples reported in the literature, briefly discusses new methods used in processing, and offers conclusions. We argue for adapting and extending related work methods in the field of big data software, using Hadoop and Spark frameworks. These provide an optimal and efficient architecture for biomedical image analysis. This paper thus gives a broad overview of big data analytics to automate biomedical image diagnosis. A workflow with optimal methods and algorithm for each step is proposed.
Two architectures for image classification are suggested. We use the Hadoop framework to design the first, and the Spark framework for the second. The proposed Spark architecture allows us to develop appropriate and efficient methods to leverage a large number of images for classification, which can be customized with respect to each other.
The proposed architectures are more complete, easier, and are adaptable in all of the steps from conception. The obtained Spark architecture is the most complete, because it facilitates the implementation of algorithms with its embedded libraries.
Enabling scalable and accurate clustering of distributed ligand geometries on supercomputers
2017, Parallel Computing
Citation Excerpt :
Traditionally, docked conformations with minimum energy are assumed to be near-native. Research has shown, however, that this is not always the case [5]. Since selecting the near-native ligand geometry based on energy alone may result in incorrect conclusions, an alternative approach selects the near-native geometry from clustering.
We present an efficient and accurate clustering method for the analysis of protein-ligand docking datasets on large distributed-memory systems. For each ligand conformation in the dataset, our clustering algorithm first extracts relevant geometrical properties and transforms the properties into a single metadata point in the N-dimensional (N-D) space. Then, it performs an N-D clustering on the metadata to search for predominant clusters. Our method avoids the need to move ligand conformations among nodes, because it extracts relevant data properties locally and concurrently. By doing so, we transform the analysis problem (e.g., clustering or classification) into a search for property aggregates. Our analysis shows that when using small computer systems of up to 64 nodes, the performance is not sensitive to data content and distribution. When using larger computer systems of up to 256 nodes the scalability of simulations with strong convergence toward specific geometries is less sensitive to overheads due to the shuffling of metadata information. We also demonstrate that our method of metadata extraction captures the geometrical properties of ligand conformations more effectively and clusters and predicts near-native ligand conformations more accurately than do traditional methods, including the hierarchical clustering and energy-based scoring methods.
The usage of internet of things in healthcare: A review of mechanisms, platforms, and opportunities from a new perspective
2023, Journal of Intelligent and Fuzzy Systems
Memory-Efficient and Skew-Tolerant MapReduce over MPI for Supercomputing Systems
2020, IEEE Transactions on Parallel and Distributed Systems

View all citing articles on Scopus

¹: T. Estrada and B. Zhang have contributed equally to this work.

View full text

A scalable and accurate method for classifying protein–ligand binding geometries using a MapReduce approach

Abstract

Section snippets

Introduction and motivation

Protein–ligand docking, cross docking, and receptor ensemble docking

Methodology

Test set-up

Related work

Conclusions

Conflict of interest statement

Acknowledgments

Curr. Opin. Struct. Biol.

J. Biophys.

J. Mol. Biol.

A new method for protein modeling and design: applications to docking and structure prediction from the distorted native conformation

J. Comput. Chem.

Automatic clustering of docking poses in virtual screening process using self-organising map

Bioinf. Adv. Access

CHARMM: a program for macromolecular energy minimization, and dynamics calculations

J. Comput. Chem.

Comparative study of several algorithms for flexible ligand docking

J. Comp. Aided Mol. Des.

Empirical entropic contributions in computational docking: evaluation in APS reductase complexes

J. Comput. Chem.

Assessing scoring functions for protein–ligand interactions

J. Med. Chem.

How to do an evaluation: pitfalls and traps

J. Comp. Aided Mol. Des.

Efficient calculation of configurational entropy from molecular simulations by combining the mutual-information expansion and nearest-neighbor methods

J. Comput. Chem.

Bias, reporting, and sharing: computational evaluations of docking methods

J. Comp. Aided Mol. Des.