Abstract
Protein function prediction is an essential task in bioinformatics which benefits disease mechanism elucidation and drug target discovery. Due to the explosive growth of proteins in sequence databases and the diversity of their functions, it remains challenging to fast and accurately predict protein functions from sequences alone. Although many methods have integrated protein structures, biological networks or literature information to improve performance, these extra features are often unavailable for most proteins. Here, we propose SPROF-GO, a Sequence-based alignment-free PROtein Function predictor which leverages a pretrained language model to efficiently extract informative sequence embeddings and employs self-attention pooling to focus on important residues. The prediction is further advanced by exploiting the homology information and accounting for the overlapping communities of proteins with related functions through the label diffusion algorithm. SPROF-GO was shown to surpass state-of-the-art sequence-based and even network-based approaches by more than 14.5%, 27.3% and 10.1% in AUPR on the three sub-ontology test sets, respectively. Our method was also demonstrated to generalize well on non-homologous proteins and unseen species. Finally, visualization based on the attention mechanism indicated that SPROF-GO is able to capture sequence domains useful for function prediction.
Key points
SPROF-GO is a sequence-based protein function predictor which leverages a pretrained language model to efficiently extract informative sequence embeddings, thus bypassing expensive database searches.
SPROF-GO employs self-attention pooling to capture sequence domains useful for function prediction and provide interpretability.
SPROF-GO applies hierarchical learning strategy to produce consistent predictions and label diffusion to exploit the homology information.
SPROF-GO is accurate and robust, with better performance than state-of-the-art sequence-based and even network-based approaches, and great generalization ability on non-homologous proteins and unseen species
Competing Interest Statement
The authors have declared no competing interest.
Footnotes
Author biography:
Qianmu Yuan School of Computer Science and Engineering at Sun Yat-sen University. His research interests lie in deep learning, graph neural network, protein structure prediction, and protein function prediction.
Junjie Xie School of Computer Science and Engineering at Sun Yat-sen University. His research interests include deep learning, graph neural network, and molecule generation.
Jiancong Xie School of Computer Science and Engineering at Sun Yat-sen University. His research interests include deep learning, graph neural network, and knowledge graph.
Huiying Zhao Sun Yat-sen Memorial Hospital at Sun Yat-sen University. Her research interests include pathogenic gene analysis, protein function, and RNA function prediction.
Yuedong Yang School of Computer Science and Engineering at Sun Yat-sen University. Currently he focuses on integrating HPC and AI algorithms for biomedical research.