Abstract
Protein sequence motifs are conserved amino acid patterns of biological significance. They are vital for annotating structural and functional features of proteins. Yet, the computational methods commonly used for defining sequence motifs are typically simplified linear representations neglecting the higher-order structure of the motif. The purpose of the work is to create models of sequence motifs taking into account the internal structure of the modeled fragments. The ultimate goal is to provide the community with accurate and concise models of diverse collections of remotely related amino acid sequences that share structural features. The internal structure of amino acid sequences is modeled using a novel algorithm for unsupervised learning of weighted context-free grammar (WCFG). The proposed method learns WCFG both form positive and negative samples, whereas weights of rules are estimated using a novel Inside-Outside Contrastive Estimation algorithm. In comparison to existing approaches to learning CFG, the new method generates more concise descriptors and provides good control of the trade-off between grammar size and specificity. The method is applied to the nicotinamide adenine dinucleotide phosphate binding site motif.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Adriaans, P., Vervoort, M.: The EMILE 4.1 grammar induction toolbox. In: Adriaans, P., Fernau, H., van Zaanen, M. (eds.) ICGI 2002. LNCS (LNAI), vol. 2484, pp. 293–295. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-45790-9_24
Bailey, T.L., Elka, C.: Unsupervised learning of multiple motifs in biopolymers using expectation maximization. Mach. Learn. 21, 51–80 (1995). https://doi.org/10.1007/BF00993379
Bohren, K.M., Bullock, B., Wermuth, B., Gabbay, K.H.: The aldo-keto reductase superfamily. cDNAs and deduced amino acid sequences of human aldehyde and aldose reductases. J. Biol. Chem. 264(16), 9547–51 (1989)
Coste, F., Kerbellec, G.: A similar fragments merging approach to learn automata on proteins. In: Gama, J., Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds.) ECML 2005. LNCS (LNAI), vol. 3720, pp. 522–529. Springer, Heidelberg (2005). https://doi.org/10.1007/11564096_50
Couture, J.F., Legrand, P., Cantin, L., Luu-The, V., Labrie, F., Breton, R.: Human 20\(\alpha \)-hydroxysteroid dehydrogenase: crystallographic and site-directed mutagenesis studies lead to the identification of an alternative binding site for C21-steroids. J. Mol. Biol. 331, 593–604 (2003)
Dyrka, W., et al.: Diversity and variability of NOD-like receptors in fungi. Genome Biol. Evol. 6(12), 3137–3158 (2014)
Dyrka, W., Nebel, J.C.: A stochastic context free grammar based framework for analysis of protein sequences. BMC Bioinform. 10, 323 (2009). https://doi.org/10.1186/1471-2105-10-323
Dyrka, W., et al.: Identification of NLR-associated amyloid signaling motifs in filamentous bacteria. bioRxiv p. 2020.01.06.895854, January 2020
Dyrka, W., Pyzik, M., Coste, F., Talibart, H.: Estimating probabilistic context-free grammars for proteins using contact map constraints. PeerJ 7, e6559 (2019)
Eddy, S.R.: A probabilistic model of local sequence alignment that simplifies statistical significance estimation. PLoS Comput. Biol. 4(5), e1000069 (2008)
Friedland, R.P., Chapman, M.R.: The role of microbial amyloid in neurodegeneration. PLoS Pathog. 13, e1006654 (2017)
de la Higuera, C.: Grammatical Inference: Learning Automata and Grammars. Cambridge University Press, Cambridge (2010)
Hogenhout, W.R., Matsumoto, Y.: A fast method for statistical grammar induction. Nat. Lang. Eng. 4(3), 191–209 (1998)
Hopf, T.A., Colwell, L.J., Sheridan, R., Rost, B., Sander, C., Marks, D.S.: Three-dimensional structures of membrane proteins from genomic sequencing. Cell 149(7), 1607–21 (2012)
Johnson, M., Griffiths, T., Goldwater, S.: Bayesian inference for PCFGs via Markov chain Monte Carlo. In: Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference, pp. 139–146 (2007)
Kim, P., Zhao, J., Lu, P., Zhao, Z.: mutLBSgeneDB: mutated ligand binding site gene DataBase. Nucleic Acids Res. 45(D1), D256–D263 (2016)
Kinjo, A.R., Nakamura, H.: Comprehensive structural classification of ligand-binding motifs in proteins. Structure 17(2), 234–246 (2009)
Knudsen, B., Hein, J.: RNA secondary structure prediction using stochastic context-free grammars and evolutionary history. Bioinformatics 15, 446–54 (1999)
Kurihara, K., Sato, T.: Variational Bayesian grammar induction for natural language. In: Sakakibara, Y., Kobayashi, S., Sato, K., Nishino, T., Tomita, E. (eds.) ICGI 2006. LNCS (LNAI), vol. 4201, pp. 84–96. Springer, Heidelberg (2006). https://doi.org/10.1007/11872436_8
Lari, K., Young, S.J.: The estimation of stochastic context-free grammars using the inside-outside algorithm. Comput. Speech Lang. 4(1), 35–56 (1990)
Lathrop, R.H.: The protein threading problem with sequence amino acid interaction preferences is NP-complete. Protein Eng. Des. Sel. 7(9), 1059–1068 (1994)
Ren, B., et al.: Fundamentals of cross-seeding of amyloid proteins: an introduction. J. Mater. Chem. B 7, 7267–7282 (2019)
Sigrist, C.J.A., et al.: New and continuing developments at PROSITE. Nucleic Acids Res. 41(D1), D344–D347 (2013)
Smith, N.A., Eisner, J.: Guiding unsupervised grammar induction using contrastive estimation. In: Proceedings of IJCAI Workshop on Grammatical Inference Applications, pp. 73–82 (2005)
Solan, Z., Horn, D., Ruppin, E., Edelman, S.: Unsupervised learning of natural languages. Proc. Natl. Acad. Sci. 102(33), 11629–11634 (2005)
Stolcke, A., Omohundro, S.: Inducing probabilistic grammars by Bayesian model merging. In: Carrasco, R.C., Oncina, J. (eds.) ICGI 1994. LNCS, vol. 862, pp. 106–118. Springer, Heidelberg (1994). https://doi.org/10.1007/3-540-58473-0_141
Talibart, H., Coste, F.: Using residues coevolution to search for protein homologs through alignment of Potts models. JOBIM (2019). https://hal.inria.fr/hal-02402687, poster
The UniProt Consortium: UniProt: the universal protein knowledgebase. Nucleic Acids Res. 45(D1), D158–D169 (2017)
Unold, O.: Context-free grammar induction with grammar-based classifier system. Arch. Control Sci. 15(4), 681–690 (2005)
Unold, O.: Fuzzy grammar-based prediction of amyloidogenic regions. In: International Conference on Grammatical Inference, pp. 210–219 (2012)
Unold., O., Gabor., M., Wieczorek., W.: Unsupervised statistical learning of context-free grammar. In: Proceedings of the 12th International Conference on Agents and Artificial Intelligence - Volume 1: NLPinAI, pp. 431–438. INSTICC, SciTePress (2020)
Van Zaanen, M.: ABL: alignment-based learning. In: Proceedings of the 18th Conference on Computational Linguistics, vol 2, pp. 961–967. Association for Computational Linguistics (2000)
Wieczorek, W.: A local search algorithm for grammatical inference. In: Sempere, J.M., García, P. (eds.) ICGI 2010. LNCS (LNAI), vol. 6339, pp. 217–229. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15488-1_18
Acknowledgements
The research was supported by the National Science Centre Poland (NCN), project registration no. 2016/21/B/ST6/02158.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Unold, O., Gabor, M., Dyrka, W. (2020). Unsupervised Grammar Induction for Revealing the Internal Structure of Protein Sequence Motifs. In: Michalowski, M., Moskovitch, R. (eds) Artificial Intelligence in Medicine. AIME 2020. Lecture Notes in Computer Science(), vol 12299. Springer, Cham. https://doi.org/10.1007/978-3-030-59137-3_27
Download citation
DOI: https://doi.org/10.1007/978-3-030-59137-3_27
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-59136-6
Online ISBN: 978-3-030-59137-3
eBook Packages: Computer ScienceComputer Science (R0)