ABSTRACT
A promising direction in cancer drug discovery is high-throughput screening of extensive compound datasets to identify advantageous properties, including their ability to interact with relevant biomolecules such as proteins. However, traditional structural approaches for assessing binding affinity, such as free energy methods or molecular docking, pose significant computational bottlenecks when dealing with such vast datasets. To address this, we have developed a docking surrogate called the SMILES transformer (ST), which learns molecular features from the SMILES representation of compounds and approximates their binding affinity. SMILES data is first tokenized using a well-established SMILES-pair tokenizer and fed into a BERT-like Transformer model to generate vector embeddings for each molecule, effectively capturing the essential information. These extracted embeddings are then fed into a regression model to predict the binding affinity. Leveraging the high-performance computing resources at Argonne National Lab, we devised a workflow to scale model training and inference across multiple supercomputing nodes. To evaluate the performance and accuracy of our workflow, we conducted experiments using molecular docking binding affinity data on multiple receptors, comparing ST with another state-of-the-art docking surrogate. Impressively, both surrogates yielded comparable val-r2 measurements of between 70 and 90%, affirming the capability of ST to learn molecular features directly from language-based data. Furthermore, one significant advantage of the ST approach is its notably faster tokenization preprocessing compared to the alternative method, which requires generating molecular descriptors using Mordred. Our workflow facilitated screening of ∼3 billion compounds on 48 nodes of the Polaris supercomputer in approximately an hour. In summary, our approach presents an efficient means to screen extensive compound databases for potential molecular properties that could serve as lead compounds targeting cancer. Looking ahead, an important future direction for our workflow involves integrating de-novo drug design, enabling us to scale our efforts to explore the limits of synthesizable compounds within chemical space.
Index Terms
- Scalable Lead Prediction with Transformers using HPC resources
Recommendations
In silico approaches and tools for the prediction of drug metabolism and fate: A review
AbstractThe fate of administered drugs is largely influenced by their metabolism. For example, endogenous enzyme–catalyzed conversion of drugs may result in therapeutic inactivation or activation or may transform the drugs into toxic chemical ...
Highlights- In silico approaches and tools for predicting drug metabolism and fate are reviewed.
Discovery of novel influenza inhibitors targeting the interaction of dsRNA with the NS1 protein by structure-based virtual screening
Influenza A Non-structural protein 1 (NS1A) RNA-Binding Domain (RBD) bound to a double-stranded RNA (dsRNA), which can inhibit the activation of antiviral pathway. The chemical compound binding sites at this pocket have abilities to block NS1 protein to ...
Prediction of Compound-Target Interactions of Natural Products Using Large-scale Drug and Protein Information
DTMBIO '15: Proceedings of the ACM Ninth International Workshop on Data and Text Mining in Biomedical InformaticsVerifying the proteins that are targeted by compounds of natural herbs will help select natural herb-based drug candidates. However, this entails a great deal of effort to clarify the interaction throughout in vitro or in vivo experiments. In this light,...
Comments