skip to main content
10.1145/3624062.3624081acmotherconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

Scalable Lead Prediction with Transformers using HPC resources

Published:12 November 2023Publication History

ABSTRACT

A promising direction in cancer drug discovery is high-throughput screening of extensive compound datasets to identify advantageous properties, including their ability to interact with relevant biomolecules such as proteins. However, traditional structural approaches for assessing binding affinity, such as free energy methods or molecular docking, pose significant computational bottlenecks when dealing with such vast datasets. To address this, we have developed a docking surrogate called the SMILES transformer (ST), which learns molecular features from the SMILES representation of compounds and approximates their binding affinity. SMILES data is first tokenized using a well-established SMILES-pair tokenizer and fed into a BERT-like Transformer model to generate vector embeddings for each molecule, effectively capturing the essential information. These extracted embeddings are then fed into a regression model to predict the binding affinity. Leveraging the high-performance computing resources at Argonne National Lab, we devised a workflow to scale model training and inference across multiple supercomputing nodes. To evaluate the performance and accuracy of our workflow, we conducted experiments using molecular docking binding affinity data on multiple receptors, comparing ST with another state-of-the-art docking surrogate. Impressively, both surrogates yielded comparable val-r2 measurements of between 70 and 90%, affirming the capability of ST to learn molecular features directly from language-based data. Furthermore, one significant advantage of the ST approach is its notably faster tokenization preprocessing compared to the alternative method, which requires generating molecular descriptors using Mordred. Our workflow facilitated screening of ∼3 billion compounds on 48 nodes of the Polaris supercomputer in approximately an hour. In summary, our approach presents an efficient means to screen extensive compound databases for potential molecular properties that could serve as lead compounds targeting cancer. Looking ahead, an important future direction for our workflow involves integrating de-novo drug design, enabling us to scale our efforts to explore the limits of synthesizable compounds within chemical space.

Index Terms

  1. Scalable Lead Prediction with Transformers using HPC resources
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Other conferences
            SC-W '23: Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis
            November 2023
            2180 pages
            ISBN:9798400707858
            DOI:10.1145/3624062

            Copyright © 2023 ACM

            Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 12 November 2023

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
            • Research
            • Refereed limited
          • Article Metrics

            • Downloads (Last 12 months)19
            • Downloads (Last 6 weeks)2

            Other Metrics

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          HTML Format

          View this article in HTML Format .

          View HTML Format