Annotated Stochastic Context Free Grammars for Analysis and Synthesis of Proteins

Sciacca, Eva; Spinella, Salvatore; Ienco, Dino; Giannini, Paola

doi:10.1007/978-3-642-20389-3_8

Eva Sciacca¹⁹,
Salvatore Spinella¹⁹,
Dino Ienco¹⁹ &
…
Paola Giannini^19,20

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 6623))

Included in the following conference series:

European Conference on Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics

917 Accesses
3 Citations

Abstract

An important step to understand the main functions of a specific family of proteins is the detection of protein features that could reveal how protein chains are constituted. To achieve this aim we treated amino acid sequences of proteins as a formal language, building a Context-Free Grammar annotated using an n-gram Bayesian classifier. This formalism is able to analyze the connection between protein chains and protein functions. In order to design new protein chains with the properties of the considered family we performed a rule clustering of the grammar to build an Annotated Stochastic Context Free Grammar.

Our methodology was applied to a class of Antimicrobial Peptides (AmPs): the Frog antimicrobial peptides family. Through this case study, our approach pointed out some important aspects regarding the relationship between sequences and functional domains of proteins and how protein domain motifs are preserved by natural evolution in to the amino acid sequences. Moreover our results suggest that the synthesis of new proteins with a given domain architecture can be one of the fields where application of Annotated Stochastic Context Free Grammars can be useful.

This research is founded by the BioBITs Project (Converging Technologies 2007, area: Biotechnology-ICT).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Abe, N., Mamitsuka, H.: Predicting protein secondary structure using stochastic tree grammars. Machine Learning 29(2), 275–301 (1997)
Article MATH Google Scholar
Breyer, L.: The DBACL text classifier (2005), http://www.lbreyer.com/preprints/dbacl.ps.gz
Dyrka, W., Nebel, J.: A stochastic context free grammar based framework for analysis of protein sequences. BMC Bioinformatics 10(1), 323 (2009)
Article Google Scholar
Finn, R., Mistry, J., Tate, J., Coggill, P., Heger, A., Pollington, J., Gavin, O., Gunasekaran, P., Ceric, G., Forslund, K., et al.: The Pfam protein families database. Nucleic Acids Research (2009)
Google Scholar
Geman, S., Johnson, M.: Probabilistic grammars and their applications. In: International Encyclopedia of the Social & Behavioral Sciences, pp. 12075–12082 (2002)
Google Scholar
Goodman, L.A., Kruskal, W.H.: Measures of association for cross classification. Journal of the American Statistical Association 49, 732–764 (1954)
MATH Google Scholar
Henikoff, S., Henikoff, J.: Amino acid substitution matrices from protein blocks. Proceedings of the National Academy of Sciences of the United States of America 89(22), 10915 (1992)
Article Google Scholar
Heyer, L., Kruglyak, S., Yooseph, S.: Exploring expression data: identification and analysis of coexpressed genes. Genome Research 9(11), 1106 (1999)
Article Google Scholar
Ienco, D., Pensa, R.G., Meo, R.: Parameter-free hierarchical co-clustering by n-ary splits. In: ECML/PKDD (1), pp. 580–595 (2009)
Google Scholar
Kriegel, H.P., Kröger, P., Zimek, A.: Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering. TKDD 3(1) (2009)
Google Scholar
Kwon, S., Carlson, B., Park, J., Lee, B.: Structural organization and expression of the gaegurin 4 gene of Rana rugosa. Biochimica et Biophysica Acta 1492(1) (2000)
Google Scholar
Mor, A., Pierre, N.: Isolation and structure of novel defensive peptides from frog skin. European Journal of Biochemistry 219(1-2), 145–154 (2005)
Article Google Scholar
Muggleton, S., Bryant, C., Srinivasan, A., Whittaker, A., Topp, S., Rawlings, C.: Are grammatical representations useful for learning from biological sequence data?-a case study. Journal of Computational Biology 8(5), 493–521 (2001)
Article Google Scholar
Otaki, J., Ienaka, S., Gotoh, T., Yamamoto, H.: Availability of short amino acid sequences in proteins. Protein Science: A Publication of the Protein Society 14(3), 617 (2005)
Article Google Scholar
Peng, F., Schuurmans, D.: Combining naive Bayes and n-gram language models for text classification. In: Peng, F., Schuurmans, D. (eds.) Advances in Information Retrieval, pp. 547–547 (2003)
Google Scholar
Peris, P., López, D., Campos, M., Sempere, J.: Protein Motif Prediction by Grammatical Inference. In: Sakakibara, Y., Kobayashi, S., Sato, K., Nishino, T., Tomita, E. (eds.) ICGI 2006. LNCS (LNAI), vol. 4201, pp. 175–187. Springer, Heidelberg (2006)
Chapter Google Scholar
Rigoutsos, I., Floratos, A.: Combinatorial pattern discovery in biological sequences: The TEIRESIAS algorithm. Bioinformatics-Oxford 14(1), 55–67 (1998)
Article Google Scholar
Rinaldi, A.: Antimicrobial peptides from amphibian skin: an expanding scenario: Commentary. Current Opinion in Chemical Biology 6(6), 799–804 (2002)
Article Google Scholar
Schuster-Böckler, B., Schultz, J., Rahmann, S.: HMM Logos for visualization of protein families. BMC Bioinformatics 5(1), 7 (2004)
Article Google Scholar
Searls, D.B.: The computational linguistics of biological sequences. Artificial Intelligence and Molecular Biology, 47–120 (1993)
Google Scholar
Searls, D.B.: The language of genes. Nature 420(6912), 211–217 (2002)
Article Google Scholar
Sigrist, C., Cerutti, L., De Castro, E., Langendijk-Genevaux, P., Bulliard, V., Bairoch, A., Hulo, N.: PROSITE, a protein domain database for functional characterization and annotation. Nucleic Acids Research 38(Database issue), D161 (2010)
Article Google Scholar
Vignal, E., Chavanieu, A., Roch, P., Chiche, L., Grassy, G., Calas, B., Aumelas, A.: Solution structure of the antimicrobial peptide ranalexin and a study of its interaction with perdeuterated dodecylphosphocholine micelles. European Journal of Biochemistry 253(1), 221–228 (2001)
Article Google Scholar
Waldispühl, J., Steyaert, J.: Modeling and predicting all-α transmembrane proteins including helix-helix pairing. Theoretical Computer Science 335(1), 67–92 (2005)
Article MathSciNet MATH Google Scholar
Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Transactions on Information Theory 23(3), 337–343 (1977)
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Dipartimento di Informatica, Università di Torino, Corso Svizzera 185, I-10149, Torino, Italy
Eva Sciacca, Salvatore Spinella, Dino Ienco & Paola Giannini
Dipartimento di Informatica, Università del Piemonte Orientale, Via Bellini 25/G, 15100, Alessandria, Italy
Paola Giannini

Authors

Eva Sciacca
View author publications
You can also search for this author in PubMed Google Scholar
Salvatore Spinella
View author publications
You can also search for this author in PubMed Google Scholar
Dino Ienco
View author publications
You can also search for this author in PubMed Google Scholar
Paola Giannini
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute for High-Performance Computing and Networking (ICAR), Italian National Research Council (CNR), Via P. Bucci 41C, 87036 Rende, (CS), Italy
Clara Pizzuti
Center for Human Genetics Research, Vanderbilt University, 519 Light Hall, TN 37232, Nashville, USA
Marylyn D. Ritchie
Department of Animal Production Epidemiology and Ecology, University of Torino, Via Leonardo da Vinci 44, 10095, Grugliasco, (TO), Italy
Mario Giacobini

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sciacca, E., Spinella, S., Ienco, D., Giannini, P. (2011). Annotated Stochastic Context Free Grammars for Analysis and Synthesis of Proteins. In: Pizzuti, C., Ritchie, M.D., Giacobini, M. (eds) Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics. EvoBIO 2011. Lecture Notes in Computer Science, vol 6623. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20389-3_8

Download citation

DOI: https://doi.org/10.1007/978-3-642-20389-3_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-20388-6
Online ISBN: 978-3-642-20389-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics