Skip to main content

Genome analysis: Pattern search in biological macromolecules

  • Conference paper
  • First Online:
Combinatorial Pattern Matching (CPM 1995)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 937))

Included in the following conference series:

Abstract

Biological sequence data analysis has developed into an inevitable tool for macromolecular biology, key to any detailed understanding of the living cell. A brief survey on the biological macromolecules and their function is given. Sequence data analysis is introduced as a basic tool for the experimental bench biologist. So far, most queries for such analyses are issued on flat files and static indices. We discuss position tree structures and their potential in sequence data analysis. The hash position tree is introduced as a persistent, dynamic data structure for pattern searches in large sequence databases in biology.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. P. Edman: A method for the determination of the amino acid sequences in peptides Arch. Biochem. 22, 457 (1949)

    Google Scholar 

  2. F. Sanger: The arrangement of amino acids in proteins. Adv. ProteinChem. 7:1–67 (1952)

    Google Scholar 

  3. F. Sanger, E.O.P. Thompson: The amino-acid sequence in the phenylalanyl chain of insulin. Biochem. J. 53, 366–374 (1953)

    PubMed  Google Scholar 

  4. M. Dayhoff (edt.): Atlas of Protein Sequence and Structure” National Biomedical Research Foundation. Silver Spring, Maryland (1978)

    Google Scholar 

  5. A.M. Maxam, W. Gilbert W.: A new method for sequencing DNA. Proc. Natl. Acad. Sci. USA 74, 560–564 (1977)

    PubMed  Google Scholar 

  6. R.M. Schwartz, M.O. Dayhoff: Origins of Prokaryotes, Eukaryotes, Mitochondria, and Chloroplasts. Science 199, 355 (1978)

    Google Scholar 

  7. J. Devereux, P. Haeberli; O. Smithies: A comprehensive set of sequence analysis programs for the VAX. Nucleic Acids Res. 12, 387–395 (1984)

    PubMed  Google Scholar 

  8. C. Rawlings: “Software Directory for Molecular Biologists.” MacMillan, London (1986)

    Google Scholar 

  9. W.C. Barker, D.G. George, H.W. Mewes, F. Pfeiffer, A. Tsugita: The PIR-International databases: Nucl. Acids Res. 22, 3089–3092 (1994)

    Google Scholar 

  10. D.A. Benson, M. Boguski, D.J. Lipman, J. Ostell: GenBank. Nucl. Acids Res. 22, 3441–3444 (1994)

    PubMed  Google Scholar 

  11. D.B. Emmert, P.J. Stoehr, G. Stoesser, G. Cameron: The European Bioinformatics Institute (EBI) databases. Nucl. Acids Res. 22, 3445–3449 (1994)

    PubMed  Google Scholar 

  12. K.H. Fasman, A.J. Cuticchia, D.T. Kingsbury: The GDB (TM) human genome data base anno 1994. Nucl. Acids Res. 22, 3462–3469 (1994)

    PubMed  Google Scholar 

  13. A. Bairoch, B. Boeckmann: The SWISS-PROT protein sequence data bank: current status. Nucl. Acids. Research 22, 1994: 22, 3578–3580

    Google Scholar 

  14. Goffeau A. (edt.): Sequencing the Yeast Genome, A detailed assessment. Commission of the European Communities (1988)

    Google Scholar 

  15. S.G. Oliver, Q.J.M. van der Aart, M.L. Agostoni-Carbone, M. Aigle, L. Alberghina, D. Alexandraki, G. Antoine, R. Anwar, J.P.G. Ballesta, P. Benit, G. Berben, E. Bergantino, N. Biteau, P.A. Bolle, M. Bolotin-Fukuhara, A. Brown, A.J.P. Brown, J.M. Buhler, C. Carcano, G. Carignani, H. Cederberg, R. Chanet, R. Contreras, M. Crouzet, B. Daignan-Fornier, E. Defoor, M. Delgado, C. Doira, J. Demolder, E. Dubois, B. Dujon, A. Dusterhoft, D. Erdmann, M. Esteban, F. Fabre, C. Fairhead, G. Faye, H. Feldmann, W. Fiers, M.C. Francingues-Gaillard, L. Franco, L. Frontali, H. Fukuhara, L.J. Fuller, P. Galland, M.E. Gent, D. Gigot, V. Gilliquet, N. Glansdorff, A. Goffeau, M. Grenson, P. Grisanti, L.A. Grivell, M. de Haan, M. Haasemann, D. Hatat, J. Hoenicka, J. Hegemann, C.J. Herbert, F. Hilger, S. Hohmann, C.P. Hollenberg, K. Huse, F. Iborra, K.J. Indge, K. Isono, C. Jacq, M. Jacquet, C.M. James, J.C. Jauniaux, Y. Jia, A. Jimenez, A. Kelly, Kleinhans U., Kreisl P., G. Lanfranchi, C. Lewis, C.G. van der Linden, G. Lucchini, K. Lutzenkirchen, M.J. Maat, G. Mannhaupt, E. Martegani, A. Mathieu, C.T.C. Maurer, D. McConnell, R.A. McKee, H.W. Mewes, F. Messenguy, F. Molemans, M.A. Montague, M. Falconi, F. Muzi, L. Navas, C.S. Newlon, D. Noone, C. Pallier, L. Panzeri, B.M. Pearson, Perea J., P. Philippsen, A. Pierard, R.J. Planta, P. Plevani, B. Poetsch, F. Pohl, B. Purnelle, M. Ramezani-Rad, S.W. Rasmussen, A. Raynal, M. Remacha, P. Richterich, A.B. Roberts, F. Rodriguez, E. Sanz, I. Schaaff-Gerstenschlager, B. Scherens, B. Schweitzer, Y. Shu, J. Skala, P.P. Slonimski, F. Sor, C. Soustelle, R. Spiegelberg, L.I. Stateva, H.Y. Steensma, S. Steiner, A. Thierry, G. Thireos, M. Tzermia, L.A. Urrestarazu, G. Valle, I. Vetter, J.C. van Vliet-Reedijk, M. Voet, G. Volckaert, P. Vreken, H. Wang, J.R. Warmington, D. von Wettstein, B.L. Wicksteed, C. Wilson, H. Wurst, G. Xu, F.K. Zimmermann, J.G. Sgouros: The complete DNA sequence of yeast chromosome III. Nature 357, 38–46 (1992)

    PubMed  Google Scholar 

  16. B. Dujon, D. Alexandraki, B. Andre, W. Ansorge, V. Baladron, J.P.G. Ballesta, A. Banrevi, P.A. A. Bolle, M. Bolotin-Fukuhara, P. Bossier, G. Bou, J. Boyer, M.J. Bultrago, G. Cheret, L. Colleaux, B. Daignan-Fornier, F. del Rey, C. Dion, H. Domdey, A. Duesterhoeft, S. Duesterhus, K.D. Entian, H. Erfle, P.F. Esteban, H. Feldmann, L. Fernandes, G.M. Fobo, C. Fritz, H. Fukuhara, C. Gabel, L. Gaillon, J.M. Carcia-Cantalejo, J.J. Garcia-Ramirez, M.E. Gent, M. Ghazvini, A. Goffeau, A. Gonzalez, D. Grothues, P. Guerreiro, J. Hegemann, N. Hewitt, F. Hilger, C.P. Hollenberg, O. Horaitis, K.J. Indge, A. Jacquier, C.M. James, J.C. Jauniaux, A. Jimenez, H. Keuchel, L. Kirchrath, K. Kleine, P. Koetter, P. Legrain, S. Liebl, E.J. Louis, A. Maia e Silva, C. Marck, A.L. Monnier, D. Moestl, S. Mueller, B. Obermaier, S.G. Oliver, C. Pallier, S. Pascolo, F. Pfeiffer, P. Philippsen, R.J. Planta, F.M. Pohl, T.M. Pohl, R. Poehlmann, D. Porteteile, B. Purnelle, V. Puzos, M.R. Rad, S.W. Rasmussen, M. Remacha, J.L. Revuelta, G.F. Richard, M. Rieger, C. Rodrigues-Pousada, M. Rose, T. Rupp, M.A. Santos, C Schwager, C. Sensen, J. Skala, H. Soares, F. Sor, J. Stegemann, H. Tettelin, A. Thierry, M. Tzermia, L.A. Urrestarazu, L. van Dyck, J.C. van Vliet-Reedijk, M. Valens, M. Vandenbol, C. Vilela, S. Vissers, D. von Wettstein, H. Voss, S. Wiemann, G. Xu, J. Zimmermann, M. Haasemann, I. Becker, H.W. Mewes H.W; “The complete sequence of chromosome XI of Saccharomyces Cerevisiae”, Nature (1994) 396, 371–378

    Google Scholar 

  17. H. Feldmann, M. Aigle, G. Aljinovic, B. Andre, M.C Baclet, A. Barthe, C. Baur, A.M. Becam, N. Biteau, E. Boles, T. Brandt, M. Brendel, M. Bruckner, F. Busereau, C. Christiansen, R. Contreras, M. Crouzet, C. Cziepluch, N. Demolis, T. Delaveau, F. Doignon, H. Domdey, S. Dusterhus, E. Dubois, B. Dujon, M. Elbakkoury, K.D. Entian, M. Feuermann, W. Fiers, G.M. Fobo, C. Fritz, H. Gassenhuber, N. Glansdorff, A. Goffeau, L.A. Grivell, M. Dehaan, C. Hein, C.J. Herbert, C.P. Hollenberg, K. Holmstrom, C. Jacq, M. Jacquet, J.C. Jauniaux, J.L. Jonniaux, T. Kallesoe, P. Kiesau, L. Kirchrath, P. Kotter, S. Koroll, S. Liebl, M. Logghe, A.J.E. Lohan, EJ. Louis, ZY. Li, M.J. Maat, L. Mallet, G. Mannhaupt, F. Messenguy, T. Miosga, F. Molemans, W. Muller, S. Nasr, B. Obermaier, J. Perea, A. Pierard, E. Piravandi, F.M. Pohl, T.M. Pohl, S. Potier, M. Proft, B. Purnelle, M.R. Rad, M. Rieger, M. Rose, I. Schaaff-Gerstenschlager, C. Scherens, B. Schwarzlose, J. Skala, P.P. Slonimski, P.H.M. Smits, J.L. Souciet, H.Y. Steensma, R. Stucka, A. Urrestarazu, Q.J.M. Vanderaart, L. Vandyck, A. Vassarotti, I. Vetter, S. Vierendeels, F. Vissers, G. Wagner, P. Dewergifosse, K.H. Wolfe, M. Zagulski, F.K. Zimmermann, H.W. Mewes, K. Kleine:’ Complete DNA-Sequence of Yeast Chromosome-II', EMBO JOURNAL (1994) 13, 5795–5809

    PubMed  Google Scholar 

  18. M. Johnston, S. Andrews, R. Brinkman, J. Cooper, H. Ding, J. Dover, Z. Du, A. Favello, L. Fulton, S. Gattung, C. Geisel, J. Kirsten, T. Kucaba, L. Hillier, M. Jier, L. Johnston, Y. Langston, P. Latreille, E.J. Louis, C. Macri, E. Mardis, S. Menezes, L. Mouser, M. Nhan, L. Rifkin, L. Riles, H. St. Peter, E. Trevaskis, K. Vaughan, D. Vignati, L. Wilcox, P. Wohldman, R. Waterston, R. Wilson, M. Vaudin: Compltete Nucleiotide Sequence of Saccharomyces cerevisiae Chromosome VIII. Science 256, 2077–2082 (1994)

    Google Scholar 

  19. P. Bork, C. Ouzounis, C. Sander, M. Scharf, R. Schneider, E. Sonnhammer: Comprehensive sequence analysis of the 182 predicted open reading frames of yeast chromosome III. Protein Science 1:1677–1690 (1992)

    PubMed  Google Scholar 

  20. E.V. Koonin, P. Bork, C. Sander: Yeast chromosome III: new gene functions. EMBO Journal 13, 493–503 (1994)

    PubMed  Google Scholar 

  21. Dujon B. et al.,: Detailed evalutation of the complete sequence of chromosome XI of S. cerevisiae'. Manuscript in preparation.

    Google Scholar 

  22. R.F. Doolitle: Of URFs and ORFs: A Primer on How to Analyze Derived Amino Acid Sequences. University Science Books, Mill Valley, CA (1987)

    Google Scholar 

  23. A.M. Lesk: Computational Molecular Biology. In: Encyclopedia of Computer Science and Technology Vol. 31, Marcel Dekker, New York (1994)

    Google Scholar 

  24. R.F. Doolittle: Searching through sequence databases, in: Methods in Enzymology (R.F. Doolittle edt.) 183, 99–110 (1990)

    Google Scholar 

  25. P. Argos, M. Vingron, G. Vogt: Protein sequence comparison: methods and significance. Protein Engineering 4, 375–383 (1991)

    PubMed  Google Scholar 

  26. D.G. George, W.C. Barker, L.T. Hunt: Mutation Data Matrix and Its Uses. In: Methods in Enzymology (R.F. Doolittle edt.) 183, 333–351 (1990)

    Google Scholar 

  27. S.B. Needleman, C.D. Wunsch: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48, 443–453 (1970)

    PubMed  Google Scholar 

  28. T.F. Smith, M.S. Waterman, W.M. Fitch: Comparative biosequence metrics. J. Mol. Evol 18, 38–46 (1981)

    PubMed  Google Scholar 

  29. P. Argos: A sensitive procedure to compare amino acid sequences. J. Mol. Biol. 193, 385–396 (1987)

    PubMed  Google Scholar 

  30. J.F. Colllins, S.F. Reddaway: High-Efficiency Sequence Database Searching: Use of the Distributed Array Processor. In: G.I. Bell, T.G. Marr (eds): Computers and DNA, Addison-Wesley (1990)

    Google Scholar 

  31. W.J. Wilbur, D.J. Lipman: Rapid similarity searches of nucleic acid and protein data banks. Proc. Natl. Acad. Sci. USA 80, 726–730 (1983)

    PubMed  Google Scholar 

  32. S. Liebl, H.W. Mewes: A dynamic database of sequence similarities. Manuscript in preparation

    Google Scholar 

  33. M.S. Waterman, M. Vingron: Rapid and accurate estimates of statistical siginificance for sequence data base searches. Proc. Natl. Acad. Sci. USA 91, 4625–4628 (1994)

    PubMed  Google Scholar 

  34. C. Sander, R. Schneider: Database of homology-derived protein structures and the structural meaning of sequence alignment. Protens 9, 56–68 (1991)

    Google Scholar 

  35. M. Vingron, M.S. Waterman: Sequence alignment and penalty choice. J. Mol. Biol. 235, 1–12 (1994)

    PubMed  Google Scholar 

  36. P. Bork, R.F. Doolittle R.F.: Proposed acquisition of an animal protein domain by bacteria. Proc. Natl. Acad. Sci. USA 89, 8990–8994 (1992)

    PubMed  Google Scholar 

  37. P. Bork, C. Sander, A. Valencia: An ATPase domain common to prokaryotic cell cycle proteins, sugar kinases, actin, and hsp70 heat shock proteins. Proc. Natl. Acad. Sci. USA 89, 7290–7294 (1992)

    PubMed  Google Scholar 

  38. M. Murata, S.S. Richardson, J.L. Sussman: Simultanous comparison of three protein sequences. Proc. Natl. Acad. Sci. USA 82, 2444–2448 (1985)

    Google Scholar 

  39. G.J. Barton, M.J.E. Sternberg: Flexible Protein Sequence Patterns, A Sensitive Method to Detect Weak Structural Similarities. J. Mol. Biol. 212, 389–402 (1990)

    PubMed  Google Scholar 

  40. M. Gribskov, R. Luthy, D. Eisenberg: Profile analysis: Detection of distantly related proteins. Proc. Natl. Acad. Sci. USA 84, 4355–4359 (1987)

    PubMed  Google Scholar 

  41. J.D. Thompson, D.G. Higgins, T.J. Gibbson: “Multiple sequence alignment”, Nucleic Acids Res. 22, 4673–4680 (1994)

    PubMed  Google Scholar 

  42. P. Argos, M. Vingron, G. Vogt. Protein sequence comparison: methods and significance. Protein Engineering 4, 375–383 (1991)

    PubMed  Google Scholar 

  43. Bishop J.: Nucleic Acid and Protein Sequence Analysis. A practical approach. IRL Press (1987)

    Google Scholar 

  44. Meier, D., “The compelxity of some problems on subsequences and supersequences”, Jour. Assoc. Comput. Mach. 25 (2) (1978), 322–336.

    Google Scholar 

  45. Knuth D.E.: The Art of Computer Programming, Vol.3, Sorting and Searching, Addison-Wessley, Reading Mass. (1973)

    Google Scholar 

  46. S.F. Altschul, W. Gish, W. Miller, E.W. Myers, D.J. Lipman: Basic Local Alignment Search Tool. J. Mol. Biol. 215, 403–410 (1990)

    PubMed  Google Scholar 

  47. R. Baeza-Yates, G.H. Gonnet: A new Approach to Text Searching. Com. ACM 35, 10, 74–82 (1992)

    Google Scholar 

  48. U. Manber, R. Baeza-Yates: An algorithm for string matching with a sequence of don't cares. Information Processing Letters 37, 133–136 (1991)

    Article  Google Scholar 

  49. R. Pearson: Rapid and Sensitive Sequence Comparision with FASTP and FASTA. In: Methods in Enzymology (R.F. Doolittle edt.) 183, 63–98 (1990)

    Google Scholar 

  50. S. Wu, U. Manber Fast Text Searching Allowing Errors. Com. AC 35, 83–91 (1992)

    Google Scholar 

  51. A. Califano, I. Rigoutsos: FLASH: A Fast Look-UP Algorithm for String Homology. In: Proceedings, First International Conference on Intelligen Sysem for Molecular Biology (Hunter L., Searls D., Shavlik J. eds.) AAAI Press, Menlo Park, CA, 56–64 (1993)

    Google Scholar 

  52. U. Manber, E.W. Meyers: Suffix Arrays: A New Method for On-Line String Searches. Proceedings: First Annual ACM-SIAM Symposium on Diskrete Algorithms. 319–327 (1990)

    Google Scholar 

  53. GCG, Genetic Computer Group. GCG-Manual Release 8. Madison, Wisconsin (1994)

    Google Scholar 

  54. ATLAS-User's Guide. Document Version 10.0. NBRF Washington D.C. (1994)

    Google Scholar 

  55. E.M. McCreight: A space-economical suffix tree construction algorithm; J. As soc. Comp. Mach. 23, 262–272 (1976)

    Google Scholar 

  56. M. Kempf, R. Bayer, U. Güntzer: Time Optimal Left to Right Construction of Position Trees. Acta Informatica 24, 461–474 (1987)

    Google Scholar 

  57. T.A. Sudkamp: Languages and Machines. Addison-Wesley (1988)

    Google Scholar 

  58. K. Heumann:’ The hashed position tree: a dynamic, persistant variant of position trees. Mansucript in preparation.

    Google Scholar 

  59. A. Bairoch: PROSITE: a dictionary of sites and patterns in proteins. Nucleic Acids Research 20, 2013–2018 (1992)

    PubMed  Google Scholar 

  60. J.T.L. Wang, T.G. Marr, D. Shasha, B.A. Shipiro, G.-W. Chirn: Discovering active motifs in sets of related protein sequences and using them for classification; Nucl. Acids Res. 22, 2769–2775 (1994)

    PubMed  Google Scholar 

  61. J.D. Ullman: Principles of Dtabase and Knowledge-Base Systems, Vol. I. Computer scinece Press, Rockville. (1988)

    Google Scholar 

  62. G. Gonnet, A. Mark, S. Benner: Exhaustive Matching of the Entire Protein Sequence Database. Science 256, 1443–1445 (1992)

    PubMed  Google Scholar 

  63. C. Lefevre, J. Ikeda: Pattern recognition in DNA sequences and its application to consensus foot-printing. Comp. Appl. Biosc. 9, 349–354 (1993)

    PubMed  Google Scholar 

  64. C. Lefevere, J. Ikeda: The position end-set tree: A small automaton for ward recognition in biological sequences. Comp. Appl. Biosc. 9, 343–348 (1993)

    PubMed  Google Scholar 

  65. P. Bieganski, J. Riedl, J.V. Cartis: Generalized suffix trees for biological sequence data: applications and implementation. In: Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences. Vol.V: Biotechnology Computing; IEEE Comput. Soc. Press, 35–44. (1994)

    Google Scholar 

  66. Object Design, Inc. (1993) Reference Manual. ObjectStore Release 3.0 Beta. For VAX/VMS Systems. Burlington.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Zvi Galil Esko Ukkonen

Rights and permissions

Reprints and permissions

Copyright information

© 1995 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Mewes, H.W., Heumann, K. (1995). Genome analysis: Pattern search in biological macromolecules. In: Galil, Z., Ukkonen, E. (eds) Combinatorial Pattern Matching. CPM 1995. Lecture Notes in Computer Science, vol 937. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-60044-2_48

Download citation

  • DOI: https://doi.org/10.1007/3-540-60044-2_48

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-60044-2

  • Online ISBN: 978-3-540-49412-6

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics