skip to main content
10.1145/3459930.3469538acmconferencesArticle/Chapter ViewAbstractPublication PagesbcbConference Proceedingsconference-collections
research-article

GNNfam: utilizing sparsity in protein family predictions using graph neural networks

Published:01 August 2021Publication History

ABSTRACT

We present GNNfam, a pipeline for predicting protein families from protein sequences. GNNfam aligns proteins using pairwise sequence aligner LAST, creates a sparse graph based on the alignment scores, and employs graph neural networks (GNNs) to predict protein families. Unlike alignment-free deep learning methods such as DeepFam, GNNfam can control the sparsity of the protein similarity graph to prune uninformative edges. We develop three pruning strategies to improve the prediction accuracy, convergence, and running time of the downstream graph neural networks. We also demonstrate that semi-supervised GNNs outperform traditional graph clustering-based methods by a large margin. When trained with three labeled sequence datasets from the SCOPe and COG databases, GNNfam achieves more than 90% test accuracy when predicting protein families and performs significantly better than clustering, embedding and other deep learning methods. GNNfam is available at https://github.com/HipGraph/GNNfam.

References

  1. A. Azad, G. A. Pavlopoulos, C. A. Ouzounis, N. C. Kyrpides, and A. Buluç. HipMCL: A high-performance parallel implementation of the markov clustering algorithm for large-scale networks. Nucleic Acids Research, 46(6):e33--e33, 2018.Google ScholarGoogle ScholarCross RefCross Ref
  2. G. D. Bader and C. W. Hogue. An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics, 4(1):1--27, 2003.Google ScholarGoogle ScholarCross RefCross Ref
  3. M. L. Bileschi, D. Belanger, D. Bryant, T. Sanderson, B. Carter, D. Sculley, M. A. DePristo, and L. J. Colwell. Using deep learning to annotate the protein universe. bioRxiv, page 626507, 2019.Google ScholarGoogle Scholar
  4. B. Buchfink, C. Xie, and D. H. Huson. Fast and sensitive protein alignment using diamond. Nature Methods, 12(1):59--60, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  5. J.-M. Chandonia, N. K. Fox, and S. E. Brenner. SCOPe: manual curation and artifact removal in the structural classification of proteins-extended database. Journal of Molecular Biology, 429(3):348--355, 2017.Google ScholarGoogle ScholarCross RefCross Ref
  6. R. C. Edgar. Search and clustering orders of magnitude faster than blast. Bioinformatics, 26(19):2460--2461, 2010.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. A. J. Enright, S. Van Dongen, and C. A. Ouzounis. An efficient algorithm for large-scale detection of protein families. Nucleic Acids Research, 30(7):1575--1584, 2002.Google ScholarGoogle ScholarCross RefCross Ref
  8. M. Fey and J. E. Lenssen. Fast graph representation learning with PyTorch Geometric. In ICLR Workshop on Representation Learning on Graphs and Manifolds, 2019.Google ScholarGoogle Scholar
  9. M. Y. Galperin, K. S. Makarova, Y. I. Wolf, and E. V. Koonin. Expanded microbial genome coverage and improved protein family annotation in the cog database. Nucleic Acids Research, 43(D1):D261--D269, 2015.Google ScholarGoogle Scholar
  10. Z. Gao, G. Fu, C. Ouyang, S. Tsutsui, X. Liu, J. Yang, C. Gessner, B. Foote, D. Wild, Y. Ding, et al. edge2vec: Representation learning using edge semantics for biomedical knowledge discovery. BMC Bioinformatics, 20(1):1--15, 2019.Google ScholarGoogle ScholarCross RefCross Ref
  11. P. Goyal and E. Ferrara. Graph embedding techniques, applications, and performance: A survey. Knowledge-Based Systems, 151:78--94, 2018.Google ScholarGoogle ScholarCross RefCross Ref
  12. W. L. Hamilton, R. Ying, and J. Leskovec. Inductive representation learning on large graphs. In NIPS, 2017.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. J. Hong, Y. Luo, Y. Zhang, J. Ying, W. Xue, T. Xie, L. Tao, and F. Zhu. Protein functional annotation of simultaneously improved stability, accuracy and false discovery rate achieved by a sequence-based deep learning. Briefings in Bioinformatics, 21(4):1437--1447, 2020.Google ScholarGoogle ScholarCross RefCross Ref
  14. P. Jiang and M. Singh. Spici: a fast clustering algorithm for large biological networks. Bioinformatics, 26(8):1105--1111, 2010.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. S. M. Kielbasa, R. Wan, K. Sato, P. Horton, and M. C. Frith. Adaptive seeds tame genomic sequence comparison. Genome Research, 21(3):487--493, 2011.Google ScholarGoogle ScholarCross RefCross Ref
  16. D. P. Kingma and J. Ba. Adam: A method for stochastic optimization, 2017.Google ScholarGoogle Scholar
  17. T. N. Kipf and M. Welling. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.Google ScholarGoogle Scholar
  18. W. Li and A. Godzik. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics, 22(13):1658--1659, 2006.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. X. Lin, Z. Quan, Z.-J. Wang, T. Ma, and X. Zeng. KGNN: Knowledge graph neural network for drug-drug interaction prediction. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20 (International Joint Conferences on Artificial Intelligence Organization), pages 2739--2745, 2020.Google ScholarGoogle ScholarCross RefCross Ref
  20. T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. arXiv preprint arXiv:1310.4546, 2013.Google ScholarGoogle Scholar
  21. C. Morris, M. Ritzert, M. Fey, W. L. Hamilton, J. E. Lenssen, G. Rattan, and M. Grohe. Weisfeiler and leman go neural: Higher-order graph neural networks, 2020.Google ScholarGoogle Scholar
  22. B. Perozzi, R. Al-Rfou, and S. Skiena. DeepWalk: Online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 701--710, 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. S. C. Potter, A. Luciani, S. R. Eddy, Y. Park, R. Lopez, and R. D. Finn. Hmmer web server: 2018 update. Nucleic Acids Research, 46(W1):W200--W204, 2018.Google ScholarGoogle Scholar
  24. O. Selvitopi, S. Ekanayake, G. Guidi, G. Pavlopoulos, A. Azad, and A. Buluc. Distributed many-to-many protein sequence alignment using sparse matrices. In SC20, International Conference for High Performance Computing, Networking, Storage and Analysis, 2020, 2020.Google ScholarGoogle Scholar
  25. S. Seo, M. Oh, Y. Park, and S. Kim. Deepfam: deep learning based alignment-free method for protein family modeling and prediction. Bioinformatics, 34(13):i254--i262, 2018.Google ScholarGoogle ScholarCross RefCross Ref
  26. M. Steinegger and J. Söding. Mmseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology, 35(11):1026--1028, 2017.Google ScholarGoogle ScholarCross RefCross Ref
  27. M. Steinegger and J. Söding. Clustering huge protein sequence sets in linear time. Nature communications, 9(1):1--8, 2018.Google ScholarGoogle ScholarCross RefCross Ref
  28. R. L. Tatusov, E. V. Koonin, and D. J. Lipman. A genomic perspective on protein families. Science, 278(5338):631--637, 1997.Google ScholarGoogle ScholarCross RefCross Ref
  29. S. van Dongen. Graph clustering by flow simulation. PhD thesis, Utrecht University, 2000.Google ScholarGoogle Scholar
  30. Z. Xiao and Y. Deng. Graph embedding-based novel protein interaction prediction via higher-order graph convolutional network. PloS one, 15(9):e0238915, 2020.Google ScholarGoogle ScholarCross RefCross Ref
  31. K. Xu, W. Hu, J. Leskovec, and S. Jegelka. How powerful are graph neural networks? Proceddings of ICLR, 2019.Google ScholarGoogle Scholar
  32. R. Ying, J. You, C. Morris, X. Ren, W. L. Hamilton, and J. Leskovec. Hierarchical graph representation learning with differentiable pooling. Proceedings of NeurIPS, 2018.Google ScholarGoogle Scholar
  33. X. Yue, Z. Wang, J. Huang, S. Parthasarathy, S. Moosavinasab, Y. Huang, S. M. Lin, W. Zhang, P. Zhang, and H. Sun. Graph embedding on biomedical networks: methods, applications and evaluations. Bioinformatics, 36(4):1241--1251, 2020.Google ScholarGoogle Scholar
  34. D. Zhang and M. Kabuka. Multimodal deep representation learning for protein interaction identification and protein family classification. BMC Bioinformatics, 20(16):1--14, 2019.Google ScholarGoogle Scholar
  35. D. Zhang and M. R. Kabuka. Protein family classification with multi-layer graph convolutional networks. In 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 2390--2393. IEEE, 2018.Google ScholarGoogle ScholarCross RefCross Ref
  36. F. Zhang, H. Song, M. Zeng, Y. Li, L. Kurgan, and M. Li. Deepfunc: a deep learning framework for accurate prediction of protein functions from protein sequences and interactions. Proteomics, 19(12):1900019, 2019.Google ScholarGoogle ScholarCross RefCross Ref
  37. A. Zielezinski, S. Vinga, J. Almeida, and W. M. Karlowski. Alignment-free sequence comparison: benefits, applications, and tools. Genome Biology, 18(1):1--17, 2017.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. GNNfam: utilizing sparsity in protein family predictions using graph neural networks
          Index terms have been assigned to the content through auto-classification.

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            BCB '21: Proceedings of the 12th ACM Conference on Bioinformatics, Computational Biology, and Health Informatics
            August 2021
            603 pages
            ISBN:9781450384506
            DOI:10.1145/3459930

            Copyright © 2021 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 1 August 2021

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article

            Acceptance Rates

            Overall Acceptance Rate254of885submissions,29%
          • Article Metrics

            • Downloads (Last 12 months)13
            • Downloads (Last 6 weeks)2

            Other Metrics

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader