Skip to main content

Unsupervised Grammar Induction for Revealing the Internal Structure of Protein Sequence Motifs

  • Conference paper
  • First Online:
Artificial Intelligence in Medicine (AIME 2020)

Abstract

Protein sequence motifs are conserved amino acid patterns of biological significance. They are vital for annotating structural and functional features of proteins. Yet, the computational methods commonly used for defining sequence motifs are typically simplified linear representations neglecting the higher-order structure of the motif. The purpose of the work is to create models of sequence motifs taking into account the internal structure of the modeled fragments. The ultimate goal is to provide the community with accurate and concise models of diverse collections of remotely related amino acid sequences that share structural features. The internal structure of amino acid sequences is modeled using a novel algorithm for unsupervised learning of weighted context-free grammar (WCFG). The proposed method learns WCFG both form positive and negative samples, whereas weights of rules are estimated using a novel Inside-Outside Contrastive Estimation algorithm. In comparison to existing approaches to learning CFG, the new method generates more concise descriptors and provides good control of the trade-off between grammar size and specificity. The method is applied to the nicotinamide adenine dinucleotide phosphate binding site motif.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Adriaans, P., Vervoort, M.: The EMILE 4.1 grammar induction toolbox. In: Adriaans, P., Fernau, H., van Zaanen, M. (eds.) ICGI 2002. LNCS (LNAI), vol. 2484, pp. 293–295. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-45790-9_24

    Chapter  Google Scholar 

  2. Bailey, T.L., Elka, C.: Unsupervised learning of multiple motifs in biopolymers using expectation maximization. Mach. Learn. 21, 51–80 (1995). https://doi.org/10.1007/BF00993379

    Article  Google Scholar 

  3. Bohren, K.M., Bullock, B., Wermuth, B., Gabbay, K.H.: The aldo-keto reductase superfamily. cDNAs and deduced amino acid sequences of human aldehyde and aldose reductases. J. Biol. Chem. 264(16), 9547–51 (1989)

    Article  Google Scholar 

  4. Coste, F., Kerbellec, G.: A similar fragments merging approach to learn automata on proteins. In: Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds.) ECML 2005. LNCS (LNAI), vol. 3720, pp. 522–529. Springer, Heidelberg (2005). https://doi.org/10.1007/11564096_50

    Chapter  Google Scholar 

  5. Couture, J.F., Legrand, P., Cantin, L., Luu-The, V., Labrie, F., Breton, R.: Human 20\(\alpha \)-hydroxysteroid dehydrogenase: crystallographic and site-directed mutagenesis studies lead to the identification of an alternative binding site for C21-steroids. J. Mol. Biol. 331, 593–604 (2003)

    Article  Google Scholar 

  6. Dyrka, W., et al.: Diversity and variability of NOD-like receptors in fungi. Genome Biol. Evol. 6(12), 3137–3158 (2014)

    Article  Google Scholar 

  7. Dyrka, W., Nebel, J.C.: A stochastic context free grammar based framework for analysis of protein sequences. BMC Bioinform. 10, 323 (2009). https://doi.org/10.1186/1471-2105-10-323

    Article  Google Scholar 

  8. Dyrka, W., et al.: Identification of NLR-associated amyloid signaling motifs in filamentous bacteria. bioRxiv p. 2020.01.06.895854, January 2020

    Google Scholar 

  9. Dyrka, W., Pyzik, M., Coste, F., Talibart, H.: Estimating probabilistic context-free grammars for proteins using contact map constraints. PeerJ 7, e6559 (2019)

    Article  Google Scholar 

  10. Eddy, S.R.: A probabilistic model of local sequence alignment that simplifies statistical significance estimation. PLoS Comput. Biol. 4(5), e1000069 (2008)

    Article  MathSciNet  Google Scholar 

  11. Friedland, R.P., Chapman, M.R.: The role of microbial amyloid in neurodegeneration. PLoS Pathog. 13, e1006654 (2017)

    Article  Google Scholar 

  12. de la Higuera, C.: Grammatical Inference: Learning Automata and Grammars. Cambridge University Press, Cambridge (2010)

    Book  Google Scholar 

  13. Hogenhout, W.R., Matsumoto, Y.: A fast method for statistical grammar induction. Nat. Lang. Eng. 4(3), 191–209 (1998)

    Article  Google Scholar 

  14. Hopf, T.A., Colwell, L.J., Sheridan, R., Rost, B., Sander, C., Marks, D.S.: Three-dimensional structures of membrane proteins from genomic sequencing. Cell 149(7), 1607–21 (2012)

    Article  Google Scholar 

  15. Johnson, M., Griffiths, T., Goldwater, S.: Bayesian inference for PCFGs via Markov chain Monte Carlo. In: Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference, pp. 139–146 (2007)

    Google Scholar 

  16. Kim, P., Zhao, J., Lu, P., Zhao, Z.: mutLBSgeneDB: mutated ligand binding site gene DataBase. Nucleic Acids Res. 45(D1), D256–D263 (2016)

    Article  Google Scholar 

  17. Kinjo, A.R., Nakamura, H.: Comprehensive structural classification of ligand-binding motifs in proteins. Structure 17(2), 234–246 (2009)

    Article  Google Scholar 

  18. Knudsen, B., Hein, J.: RNA secondary structure prediction using stochastic context-free grammars and evolutionary history. Bioinformatics 15, 446–54 (1999)

    Article  Google Scholar 

  19. Kurihara, K., Sato, T.: Variational Bayesian grammar induction for natural language. In: Sakakibara, Y., Kobayashi, S., Sato, K., Nishino, T., Tomita, E. (eds.) ICGI 2006. LNCS (LNAI), vol. 4201, pp. 84–96. Springer, Heidelberg (2006). https://doi.org/10.1007/11872436_8

    Chapter  Google Scholar 

  20. Lari, K., Young, S.J.: The estimation of stochastic context-free grammars using the inside-outside algorithm. Comput. Speech Lang. 4(1), 35–56 (1990)

    Article  Google Scholar 

  21. Lathrop, R.H.: The protein threading problem with sequence amino acid interaction preferences is NP-complete. Protein Eng. Des. Sel. 7(9), 1059–1068 (1994)

    Article  Google Scholar 

  22. Ren, B., et al.: Fundamentals of cross-seeding of amyloid proteins: an introduction. J. Mater. Chem. B 7, 7267–7282 (2019)

    Article  Google Scholar 

  23. Sigrist, C.J.A., et al.: New and continuing developments at PROSITE. Nucleic Acids Res. 41(D1), D344–D347 (2013)

    Article  Google Scholar 

  24. Smith, N.A., Eisner, J.: Guiding unsupervised grammar induction using contrastive estimation. In: Proceedings of IJCAI Workshop on Grammatical Inference Applications, pp. 73–82 (2005)

    Google Scholar 

  25. Solan, Z., Horn, D., Ruppin, E., Edelman, S.: Unsupervised learning of natural languages. Proc. Natl. Acad. Sci. 102(33), 11629–11634 (2005)

    Article  Google Scholar 

  26. Stolcke, A., Omohundro, S.: Inducing probabilistic grammars by Bayesian model merging. In: Carrasco, R.C., Oncina, J. (eds.) ICGI 1994. LNCS, vol. 862, pp. 106–118. Springer, Heidelberg (1994). https://doi.org/10.1007/3-540-58473-0_141

    Chapter  Google Scholar 

  27. Talibart, H., Coste, F.: Using residues coevolution to search for protein homologs through alignment of Potts models. JOBIM (2019). https://hal.inria.fr/hal-02402687, poster

  28. The UniProt Consortium: UniProt: the universal protein knowledgebase. Nucleic Acids Res. 45(D1), D158–D169 (2017)

    Article  Google Scholar 

  29. Unold, O.: Context-free grammar induction with grammar-based classifier system. Arch. Control Sci. 15(4), 681–690 (2005)

    MATH  Google Scholar 

  30. Unold, O.: Fuzzy grammar-based prediction of amyloidogenic regions. In: International Conference on Grammatical Inference, pp. 210–219 (2012)

    Google Scholar 

  31. Unold., O., Gabor., M., Wieczorek., W.: Unsupervised statistical learning of context-free grammar. In: Proceedings of the 12th International Conference on Agents and Artificial Intelligence - Volume 1: NLPinAI, pp. 431–438. INSTICC, SciTePress (2020)

    Google Scholar 

  32. Van Zaanen, M.: ABL: alignment-based learning. In: Proceedings of the 18th Conference on Computational Linguistics, vol 2, pp. 961–967. Association for Computational Linguistics (2000)

    Google Scholar 

  33. Wieczorek, W.: A local search algorithm for grammatical inference. In: Sempere, J.M., García, P. (eds.) ICGI 2010. LNCS (LNAI), vol. 6339, pp. 217–229. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15488-1_18

    Chapter  MATH  Google Scholar 

Download references

Acknowledgements

The research was supported by the National Science Centre Poland (NCN), project registration no. 2016/21/B/ST6/02158.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Olgierd Unold .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Unold, O., Gabor, M., Dyrka, W. (2020). Unsupervised Grammar Induction for Revealing the Internal Structure of Protein Sequence Motifs. In: Michalowski, M., Moskovitch, R. (eds) Artificial Intelligence in Medicine. AIME 2020. Lecture Notes in Computer Science(), vol 12299. Springer, Cham. https://doi.org/10.1007/978-3-030-59137-3_27

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-59137-3_27

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-59136-6

  • Online ISBN: 978-3-030-59137-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics