Machine learning (ML) has disruptively changed the way scientists predict molecular structure and properties that are relevant to chemical and materials design. Graph neural networks (GNNs) are an example of ML models that have shown great promise in such tasks. However, the success of GNNs in molecular prediction relies on a supervised training strategy, which requires a large amount of labeled data: label annotation of molecules is time consuming, and more importantly, it can become impractical given the vast chemical space. Task-agnostic transformer-based language models are a promising alternative for learning from unlabeled corpora, but the string-based representations that they often use, such as SMILES (simplified molecular-input line-entry system), do not contain precise topological information — unlike GNNs, for instance — which limits the prediction accuracy in language models. In a recent study, Jerret Ross, Payel Das and colleagues introduce a large-scale transformer-based language model with relative position embedding that enables the encoding of spatial information in molecules.
The molecular language transformer (MOLFORMER) comprises two steps: pre-training and downstream molecular property prediction. MOLFORMER was first pre-trained on the SMILES sequences with unlabeled data containing 1.1 billion molecules from two public chemical datasets: PubChem and ZINC. The authors developed an approximation scheme to use the linear attention for the recently proposed rotary position embedding known as RoFormer, which improved the model’s scalability with respect to string length. The modified RoFormer also allowed the model to be aware of the relative positional information of atoms, resulting in a fast convergence speed when compared to conventional absolute embedding. In the second step, the pre-trained language model was fine-tuned with specific data for different downstream tasks, such as prediction of different molecular properties and recovery of molecular similarity. The best variant of MOLFORMER was able to outperform the state-of-the-art GNN models in various prediction tasks, including in the prediction of quantum-chemical and biophysical properties. Overall, the results demonstrate the capability of incorporating more structural information in large chemical language models, enabling accurate property prediction for molecules.
This is a preview of subscription content, access via your institution