ABSTRACT
We present GNNfam, a pipeline for predicting protein families from protein sequences. GNNfam aligns proteins using pairwise sequence aligner LAST, creates a sparse graph based on the alignment scores, and employs graph neural networks (GNNs) to predict protein families. Unlike alignment-free deep learning methods such as DeepFam, GNNfam can control the sparsity of the protein similarity graph to prune uninformative edges. We develop three pruning strategies to improve the prediction accuracy, convergence, and running time of the downstream graph neural networks. We also demonstrate that semi-supervised GNNs outperform traditional graph clustering-based methods by a large margin. When trained with three labeled sequence datasets from the SCOPe and COG databases, GNNfam achieves more than 90% test accuracy when predicting protein families and performs significantly better than clustering, embedding and other deep learning methods. GNNfam is available at https://github.com/HipGraph/GNNfam.
- A. Azad, G. A. Pavlopoulos, C. A. Ouzounis, N. C. Kyrpides, and A. Buluç. HipMCL: A high-performance parallel implementation of the markov clustering algorithm for large-scale networks. Nucleic Acids Research, 46(6):e33--e33, 2018.Google ScholarCross Ref
- G. D. Bader and C. W. Hogue. An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics, 4(1):1--27, 2003.Google ScholarCross Ref
- M. L. Bileschi, D. Belanger, D. Bryant, T. Sanderson, B. Carter, D. Sculley, M. A. DePristo, and L. J. Colwell. Using deep learning to annotate the protein universe. bioRxiv, page 626507, 2019.Google Scholar
- B. Buchfink, C. Xie, and D. H. Huson. Fast and sensitive protein alignment using diamond. Nature Methods, 12(1):59--60, 2015.Google ScholarCross Ref
- J.-M. Chandonia, N. K. Fox, and S. E. Brenner. SCOPe: manual curation and artifact removal in the structural classification of proteins-extended database. Journal of Molecular Biology, 429(3):348--355, 2017.Google ScholarCross Ref
- R. C. Edgar. Search and clustering orders of magnitude faster than blast. Bioinformatics, 26(19):2460--2461, 2010.Google ScholarDigital Library
- A. J. Enright, S. Van Dongen, and C. A. Ouzounis. An efficient algorithm for large-scale detection of protein families. Nucleic Acids Research, 30(7):1575--1584, 2002.Google ScholarCross Ref
- M. Fey and J. E. Lenssen. Fast graph representation learning with PyTorch Geometric. In ICLR Workshop on Representation Learning on Graphs and Manifolds, 2019.Google Scholar
- M. Y. Galperin, K. S. Makarova, Y. I. Wolf, and E. V. Koonin. Expanded microbial genome coverage and improved protein family annotation in the cog database. Nucleic Acids Research, 43(D1):D261--D269, 2015.Google Scholar
- Z. Gao, G. Fu, C. Ouyang, S. Tsutsui, X. Liu, J. Yang, C. Gessner, B. Foote, D. Wild, Y. Ding, et al. edge2vec: Representation learning using edge semantics for biomedical knowledge discovery. BMC Bioinformatics, 20(1):1--15, 2019.Google ScholarCross Ref
- P. Goyal and E. Ferrara. Graph embedding techniques, applications, and performance: A survey. Knowledge-Based Systems, 151:78--94, 2018.Google ScholarCross Ref
- W. L. Hamilton, R. Ying, and J. Leskovec. Inductive representation learning on large graphs. In NIPS, 2017.Google ScholarDigital Library
- J. Hong, Y. Luo, Y. Zhang, J. Ying, W. Xue, T. Xie, L. Tao, and F. Zhu. Protein functional annotation of simultaneously improved stability, accuracy and false discovery rate achieved by a sequence-based deep learning. Briefings in Bioinformatics, 21(4):1437--1447, 2020.Google ScholarCross Ref
- P. Jiang and M. Singh. Spici: a fast clustering algorithm for large biological networks. Bioinformatics, 26(8):1105--1111, 2010.Google ScholarDigital Library
- S. M. Kielbasa, R. Wan, K. Sato, P. Horton, and M. C. Frith. Adaptive seeds tame genomic sequence comparison. Genome Research, 21(3):487--493, 2011.Google ScholarCross Ref
- D. P. Kingma and J. Ba. Adam: A method for stochastic optimization, 2017.Google Scholar
- T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.Google Scholar
- W. Li and A. Godzik. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics, 22(13):1658--1659, 2006.Google ScholarDigital Library
- X. Lin, Z. Quan, Z.-J. Wang, T. Ma, and X. Zeng. KGNN: Knowledge graph neural network for drug-drug interaction prediction. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20 (International Joint Conferences on Artificial Intelligence Organization), pages 2739--2745, 2020.Google ScholarCross Ref
- T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. arXiv preprint arXiv:1310.4546, 2013.Google Scholar
- C. Morris, M. Ritzert, M. Fey, W. L. Hamilton, J. E. Lenssen, G. Rattan, and M. Grohe. Weisfeiler and leman go neural: Higher-order graph neural networks, 2020.Google Scholar
- B. Perozzi, R. Al-Rfou, and S. Skiena. DeepWalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 701--710, 2014.Google ScholarDigital Library
- S. C. Potter, A. Luciani, S. R. Eddy, Y. Park, R. Lopez, and R. D. Finn. Hmmer web server: 2018 update. Nucleic Acids Research, 46(W1):W200--W204, 2018.Google Scholar
- O. Selvitopi, S. Ekanayake, G. Guidi, G. Pavlopoulos, A. Azad, and A. Buluc. Distributed many-to-many protein sequence alignment using sparse matrices. In SC20, International Conference for High Performance Computing, Networking, Storage and Analysis, 2020, 2020.Google Scholar
- S. Seo, M. Oh, Y. Park, and S. Kim. Deepfam: deep learning based alignment-free method for protein family modeling and prediction. Bioinformatics, 34(13):i254--i262, 2018.Google ScholarCross Ref
- M. Steinegger and J. Söding. Mmseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology, 35(11):1026--1028, 2017.Google ScholarCross Ref
- M. Steinegger and J. Söding. Clustering huge protein sequence sets in linear time. Nature communications, 9(1):1--8, 2018.Google ScholarCross Ref
- R. L. Tatusov, E. V. Koonin, and D. J. Lipman. A genomic perspective on protein families. Science, 278(5338):631--637, 1997.Google ScholarCross Ref
- S. van Dongen. Graph clustering by flow simulation. PhD thesis, Utrecht University, 2000.Google Scholar
- Z. Xiao and Y. Deng. Graph embedding-based novel protein interaction prediction via higher-order graph convolutional network. PloS one, 15(9):e0238915, 2020.Google ScholarCross Ref
- K. Xu, W. Hu, J. Leskovec, and S. Jegelka. How powerful are graph neural networks? Proceddings of ICLR, 2019.Google Scholar
- R. Ying, J. You, C. Morris, X. Ren, W. L. Hamilton, and J. Leskovec. Hierarchical graph representation learning with differentiable pooling. Proceedings of NeurIPS, 2018.Google Scholar
- X. Yue, Z. Wang, J. Huang, S. Parthasarathy, S. Moosavinasab, Y. Huang, S. M. Lin, W. Zhang, P. Zhang, and H. Sun. Graph embedding on biomedical networks: methods, applications and evaluations. Bioinformatics, 36(4):1241--1251, 2020.Google Scholar
- D. Zhang and M. Kabuka. Multimodal deep representation learning for protein interaction identification and protein family classification. BMC Bioinformatics, 20(16):1--14, 2019.Google Scholar
- D. Zhang and M. R. Kabuka. Protein family classification with multi-layer graph convolutional networks. In 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 2390--2393. IEEE, 2018.Google ScholarCross Ref
- F. Zhang, H. Song, M. Zeng, Y. Li, L. Kurgan, and M. Li. Deepfunc: a deep learning framework for accurate prediction of protein functions from protein sequences and interactions. Proteomics, 19(12):1900019, 2019.Google ScholarCross Ref
- A. Zielezinski, S. Vinga, J. Almeida, and W. M. Karlowski. Alignment-free sequence comparison: benefits, applications, and tools. Genome Biology, 18(1):1--17, 2017.Google ScholarCross Ref
Index Terms
GNNfam: utilizing sparsity in protein family predictions using graph neural networks
Recommendations
On the Multichromatic Number of s-Stable Kneser Graphs
For positive integers n and s, a subset Sï [n] is s-stable if sï |i-j|ï n-s for distinct i,j∈S . The s-stable r-uniform Kneser hypergraph KGrn,ks-stable is the r-uniform hypergraph that has the collection of all s-stable k-element subsets of [n] as ...
Adjacent vertex-distinguishing edge and total chromatic numbers of hypercubes
An adjacent vertex-distinguishing edge coloring of a simple graph G is a proper edge coloring of G such that incident edge sets of any two adjacent vertices are assigned different sets of colors. A total coloring of a graph G is a coloring of both the ...
Antiviral potential of natural compounds against influenza virus hemagglutinin
The antiviral activity of natural compounds against the HA protein of different subtypes of Influenza virus has been investigated using binding free energy and hydrogen bonding interactions.Display Omitted The curucmin derivatives (CI, CII and CIII) ...
Comments