Large language model for molecular chemistry

Pan, Jie

doi:10.1038/s43588-023-00399-1

Research Highlight
Published: 23 January 2023

Cheminformatics

Large language model for molecular chemistry

Jie Pan¹

Nature Computational Science volume 3, page 5 (2023)Cite this article

2151 Accesses
11 Citations
9 Altmetric
Metrics details

Subjects

Access through your institution

Buy or subscribe

Machine learning (ML) has disruptively changed the way scientists predict molecular structure and properties that are relevant to chemical and materials design. Graph neural networks (GNNs) are an example of ML models that have shown great promise in such tasks. However, the success of GNNs in molecular prediction relies on a supervised training strategy, which requires a large amount of labeled data: label annotation of molecules is time consuming, and more importantly, it can become impractical given the vast chemical space. Task-agnostic transformer-based language models are a promising alternative for learning from unlabeled corpora, but the string-based representations that they often use, such as SMILES (simplified molecular-input line-entry system), do not contain precise topological information — unlike GNNs, for instance — which limits the prediction accuracy in language models. In a recent study, Jerret Ross, Payel Das and colleagues introduce a large-scale transformer-based language model with relative position embedding that enables the encoding of spatial information in molecules.

The molecular language transformer (MOLFORMER) comprises two steps: pre-training and downstream molecular property prediction. MOLFORMER was first pre-trained on the SMILES sequences with unlabeled data containing 1.1 billion molecules from two public chemical datasets: PubChem and ZINC. The authors developed an approximation scheme to use the linear attention for the recently proposed rotary position embedding known as RoFormer, which improved the model’s scalability with respect to string length. The modified RoFormer also allowed the model to be aware of the relative positional information of atoms, resulting in a fast convergence speed when compared to conventional absolute embedding. In the second step, the pre-trained language model was fine-tuned with specific data for different downstream tasks, such as prediction of different molecular properties and recovery of molecular similarity. The best variant of MOLFORMER was able to outperform the state-of-the-art GNN models in various prediction tasks, including in the prediction of quantum-chemical and biophysical properties. Overall, the results demonstrate the capability of incorporating more structural information in large chemical language models, enabling accurate property prediction for molecules.

This is a preview of subscription content, access via your institution

Relevant articles

Open Access articles citing this article.

Unveiling the power of language models in chemical research question answering
- Xiuying Chen
- , Tairan Wang
- … Xiangliang Zhang
Communications Chemistry Open Access 05 January 2025

Access options

Access through your institution

Buy this article

Purchase on SpringerLink
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

Author information

Authors and Affiliations

Nature Computational Science https://www.nature.com/natcomputsci
Jie Pan

Authors

Jie Pan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jie Pan.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pan, J. Large language model for molecular chemistry. Nat Comput Sci 3, 5 (2023). https://doi.org/10.1038/s43588-023-00399-1

Download citation

Published: 23 January 2023
Issue Date: January 2023
DOI: https://doi.org/10.1038/s43588-023-00399-1

Large language model for molecular chemistry

Subjects

Relevant articles

Unveiling the power of language models in chemical research question answering

Access options

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Large-scale chemical language representations capture molecular structure and properties

Search

Quick links

Subjects

Relevant articles

Unveiling the power of language models in chemical research question answering

Access options

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links