Skip to main content

Annotated Stochastic Context Free Grammars for Analysis and Synthesis of Proteins

  • Conference paper
Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics (EvoBIO 2011)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 6623))

Abstract

An important step to understand the main functions of a specific family of proteins is the detection of protein features that could reveal how protein chains are constituted. To achieve this aim we treated amino acid sequences of proteins as a formal language, building a Context-Free Grammar annotated using an n-gram Bayesian classifier. This formalism is able to analyze the connection between protein chains and protein functions. In order to design new protein chains with the properties of the considered family we performed a rule clustering of the grammar to build an Annotated Stochastic Context Free Grammar.

Our methodology was applied to a class of Antimicrobial Peptides (AmPs): the Frog antimicrobial peptides family. Through this case study, our approach pointed out some important aspects regarding the relationship between sequences and functional domains of proteins and how protein domain motifs are preserved by natural evolution in to the amino acid sequences. Moreover our results suggest that the synthesis of new proteins with a given domain architecture can be one of the fields where application of Annotated Stochastic Context Free Grammars can be useful.

This research is founded by the BioBITs Project (Converging Technologies 2007, area: Biotechnology-ICT).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Abe, N., Mamitsuka, H.: Predicting protein secondary structure using stochastic tree grammars. Machine Learning 29(2), 275–301 (1997)

    Article  MATH  Google Scholar 

  2. Breyer, L.: The DBACL text classifier (2005), http://www.lbreyer.com/preprints/dbacl.ps.gz

  3. Dyrka, W., Nebel, J.: A stochastic context free grammar based framework for analysis of protein sequences. BMC Bioinformatics 10(1), 323 (2009)

    Article  Google Scholar 

  4. Finn, R., Mistry, J., Tate, J., Coggill, P., Heger, A., Pollington, J., Gavin, O., Gunasekaran, P., Ceric, G., Forslund, K., et al.: The Pfam protein families database. Nucleic Acids Research (2009)

    Google Scholar 

  5. Geman, S., Johnson, M.: Probabilistic grammars and their applications. In: International Encyclopedia of the Social & Behavioral Sciences, pp. 12075–12082 (2002)

    Google Scholar 

  6. Goodman, L.A., Kruskal, W.H.: Measures of association for cross classification. Journal of the American Statistical Association 49, 732–764 (1954)

    MATH  Google Scholar 

  7. Henikoff, S., Henikoff, J.: Amino acid substitution matrices from protein blocks. Proceedings of the National Academy of Sciences of the United States of America 89(22), 10915 (1992)

    Article  Google Scholar 

  8. Heyer, L., Kruglyak, S., Yooseph, S.: Exploring expression data: identification and analysis of coexpressed genes. Genome Research 9(11), 1106 (1999)

    Article  Google Scholar 

  9. Ienco, D., Pensa, R.G., Meo, R.: Parameter-free hierarchical co-clustering by n-ary splits. In: ECML/PKDD (1), pp. 580–595 (2009)

    Google Scholar 

  10. Kriegel, H.P., Kröger, P., Zimek, A.: Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering. TKDD 3(1) (2009)

    Google Scholar 

  11. Kwon, S., Carlson, B., Park, J., Lee, B.: Structural organization and expression of the gaegurin 4 gene of Rana rugosa. Biochimica et Biophysica Acta 1492(1) (2000)

    Google Scholar 

  12. Mor, A., Pierre, N.: Isolation and structure of novel defensive peptides from frog skin. European Journal of Biochemistry 219(1-2), 145–154 (2005)

    Article  Google Scholar 

  13. Muggleton, S., Bryant, C., Srinivasan, A., Whittaker, A., Topp, S., Rawlings, C.: Are grammatical representations useful for learning from biological sequence data?-a case study. Journal of Computational Biology 8(5), 493–521 (2001)

    Article  Google Scholar 

  14. Otaki, J., Ienaka, S., Gotoh, T., Yamamoto, H.: Availability of short amino acid sequences in proteins. Protein Science: A Publication of the Protein Society 14(3), 617 (2005)

    Article  Google Scholar 

  15. Peng, F., Schuurmans, D.: Combining naive Bayes and n-gram language models for text classification. In: Peng, F., Schuurmans, D. (eds.) Advances in Information Retrieval, pp. 547–547 (2003)

    Google Scholar 

  16. Peris, P., López, D., Campos, M., Sempere, J.: Protein Motif Prediction by Grammatical Inference. In: Sakakibara, Y., Kobayashi, S., Sato, K., Nishino, T., Tomita, E. (eds.) ICGI 2006. LNCS (LNAI), vol. 4201, pp. 175–187. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  17. Rigoutsos, I., Floratos, A.: Combinatorial pattern discovery in biological sequences: The TEIRESIAS algorithm. Bioinformatics-Oxford 14(1), 55–67 (1998)

    Article  Google Scholar 

  18. Rinaldi, A.: Antimicrobial peptides from amphibian skin: an expanding scenario: Commentary. Current Opinion in Chemical Biology 6(6), 799–804 (2002)

    Article  Google Scholar 

  19. Schuster-Böckler, B., Schultz, J., Rahmann, S.: HMM Logos for visualization of protein families. BMC Bioinformatics 5(1), 7 (2004)

    Article  Google Scholar 

  20. Searls, D.B.: The computational linguistics of biological sequences. Artificial Intelligence and Molecular Biology, 47–120 (1993)

    Google Scholar 

  21. Searls, D.B.: The language of genes. Nature 420(6912), 211–217 (2002)

    Article  Google Scholar 

  22. Sigrist, C., Cerutti, L., De Castro, E., Langendijk-Genevaux, P., Bulliard, V., Bairoch, A., Hulo, N.: PROSITE, a protein domain database for functional characterization and annotation. Nucleic Acids Research 38(Database issue), D161 (2010)

    Article  Google Scholar 

  23. Vignal, E., Chavanieu, A., Roch, P., Chiche, L., Grassy, G., Calas, B., Aumelas, A.: Solution structure of the antimicrobial peptide ranalexin and a study of its interaction with perdeuterated dodecylphosphocholine micelles. European Journal of Biochemistry 253(1), 221–228 (2001)

    Article  Google Scholar 

  24. Waldispühl, J., Steyaert, J.: Modeling and predicting all-α transmembrane proteins including helix-helix pairing. Theoretical Computer Science 335(1), 67–92 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  25. Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Transactions on Information Theory 23(3), 337–343 (1977)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Sciacca, E., Spinella, S., Ienco, D., Giannini, P. (2011). Annotated Stochastic Context Free Grammars for Analysis and Synthesis of Proteins. In: Pizzuti, C., Ritchie, M.D., Giacobini, M. (eds) Evolutionary Computation, Machine Learning and Data Mining in Bioinformatics. EvoBIO 2011. Lecture Notes in Computer Science, vol 6623. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20389-3_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-20389-3_8

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-20388-6

  • Online ISBN: 978-3-642-20389-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics