Genome analysis: Pattern search in biological macromolecules

Mewes, H. W.; Heumann, K.

doi:10.1007/3-540-60044-2_48

H. W. Mewes¹ &
K. Heumann¹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 937))

Included in the following conference series:

Annual Symposium on Combinatorial Pattern Matching

173 Accesses
6 Citations

Abstract

Biological sequence data analysis has developed into an inevitable tool for macromolecular biology, key to any detailed understanding of the living cell. A brief survey on the biological macromolecules and their function is given. Sequence data analysis is introduced as a basic tool for the experimental bench biologist. So far, most queries for such analyses are issued on flat files and static indices. We discuss position tree structures and their potential in sequence data analysis. The hash position tree is introduced as a persistent, dynamic data structure for pattern searches in large sequence databases in biology.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

P. Edman: A method for the determination of the amino acid sequences in peptides Arch. Biochem. 22, 457 (1949)
Google Scholar
F. Sanger: The arrangement of amino acids in proteins. Adv. ProteinChem. 7:1–67 (1952)
Google Scholar
F. Sanger, E.O.P. Thompson: The amino-acid sequence in the phenylalanyl chain of insulin. Biochem. J. 53, 366–374 (1953)
PubMed Google Scholar
M. Dayhoff (edt.): Atlas of Protein Sequence and Structure” National Biomedical Research Foundation. Silver Spring, Maryland (1978)
Google Scholar
A.M. Maxam, W. Gilbert W.: A new method for sequencing DNA. Proc. Natl. Acad. Sci. USA 74, 560–564 (1977)
PubMed Google Scholar
R.M. Schwartz, M.O. Dayhoff: Origins of Prokaryotes, Eukaryotes, Mitochondria, and Chloroplasts. Science 199, 355 (1978)
Google Scholar
J. Devereux, P. Haeberli; O. Smithies: A comprehensive set of sequence analysis programs for the VAX. Nucleic Acids Res. 12, 387–395 (1984)
PubMed Google Scholar
C. Rawlings: “Software Directory for Molecular Biologists.” MacMillan, London (1986)
Google Scholar
W.C. Barker, D.G. George, H.W. Mewes, F. Pfeiffer, A. Tsugita: The PIR-International databases: Nucl. Acids Res. 22, 3089–3092 (1994)
Google Scholar
D.A. Benson, M. Boguski, D.J. Lipman, J. Ostell: GenBank. Nucl. Acids Res. 22, 3441–3444 (1994)
PubMed Google Scholar
D.B. Emmert, P.J. Stoehr, G. Stoesser, G. Cameron: The European Bioinformatics Institute (EBI) databases. Nucl. Acids Res. 22, 3445–3449 (1994)
PubMed Google Scholar
K.H. Fasman, A.J. Cuticchia, D.T. Kingsbury: The GDB (TM) human genome data base anno 1994. Nucl. Acids Res. 22, 3462–3469 (1994)
PubMed Google Scholar
A. Bairoch, B. Boeckmann: The SWISS-PROT protein sequence data bank: current status. Nucl. Acids. Research 22, 1994: 22, 3578–3580
Google Scholar
Goffeau A. (edt.): Sequencing the Yeast Genome, A detailed assessment. Commission of the European Communities (1988)
Google Scholar
S.G. Oliver, Q.J.M. van der Aart, M.L. Agostoni-Carbone, M. Aigle, L. Alberghina, D. Alexandraki, G. Antoine, R. Anwar, J.P.G. Ballesta, P. Benit, G. Berben, E. Bergantino, N. Biteau, P.A. Bolle, M. Bolotin-Fukuhara, A. Brown, A.J.P. Brown, J.M. Buhler, C. Carcano, G. Carignani, H. Cederberg, R. Chanet, R. Contreras, M. Crouzet, B. Daignan-Fornier, E. Defoor, M. Delgado, C. Doira, J. Demolder, E. Dubois, B. Dujon, A. Dusterhoft, D. Erdmann, M. Esteban, F. Fabre, C. Fairhead, G. Faye, H. Feldmann, W. Fiers, M.C. Francingues-Gaillard, L. Franco, L. Frontali, H. Fukuhara, L.J. Fuller, P. Galland, M.E. Gent, D. Gigot, V. Gilliquet, N. Glansdorff, A. Goffeau, M. Grenson, P. Grisanti, L.A. Grivell, M. de Haan, M. Haasemann, D. Hatat, J. Hoenicka, J. Hegemann, C.J. Herbert, F. Hilger, S. Hohmann, C.P. Hollenberg, K. Huse, F. Iborra, K.J. Indge, K. Isono, C. Jacq, M. Jacquet, C.M. James, J.C. Jauniaux, Y. Jia, A. Jimenez, A. Kelly, Kleinhans U., Kreisl P., G. Lanfranchi, C. Lewis, C.G. van der Linden, G. Lucchini, K. Lutzenkirchen, M.J. Maat, G. Mannhaupt, E. Martegani, A. Mathieu, C.T.C. Maurer, D. McConnell, R.A. McKee, H.W. Mewes, F. Messenguy, F. Molemans, M.A. Montague, M. Falconi, F. Muzi, L. Navas, C.S. Newlon, D. Noone, C. Pallier, L. Panzeri, B.M. Pearson, Perea J., P. Philippsen, A. Pierard, R.J. Planta, P. Plevani, B. Poetsch, F. Pohl, B. Purnelle, M. Ramezani-Rad, S.W. Rasmussen, A. Raynal, M. Remacha, P. Richterich, A.B. Roberts, F. Rodriguez, E. Sanz, I. Schaaff-Gerstenschlager, B. Scherens, B. Schweitzer, Y. Shu, J. Skala, P.P. Slonimski, F. Sor, C. Soustelle, R. Spiegelberg, L.I. Stateva, H.Y. Steensma, S. Steiner, A. Thierry, G. Thireos, M. Tzermia, L.A. Urrestarazu, G. Valle, I. Vetter, J.C. van Vliet-Reedijk, M. Voet, G. Volckaert, P. Vreken, H. Wang, J.R. Warmington, D. von Wettstein, B.L. Wicksteed, C. Wilson, H. Wurst, G. Xu, F.K. Zimmermann, J.G. Sgouros: The complete DNA sequence of yeast chromosome III. Nature 357, 38–46 (1992)
PubMed Google Scholar
B. Dujon, D. Alexandraki, B. Andre, W. Ansorge, V. Baladron, J.P.G. Ballesta, A. Banrevi, P.A. A. Bolle, M. Bolotin-Fukuhara, P. Bossier, G. Bou, J. Boyer, M.J. Bultrago, G. Cheret, L. Colleaux, B. Daignan-Fornier, F. del Rey, C. Dion, H. Domdey, A. Duesterhoeft, S. Duesterhus, K.D. Entian, H. Erfle, P.F. Esteban, H. Feldmann, L. Fernandes, G.M. Fobo, C. Fritz, H. Fukuhara, C. Gabel, L. Gaillon, J.M. Carcia-Cantalejo, J.J. Garcia-Ramirez, M.E. Gent, M. Ghazvini, A. Goffeau, A. Gonzalez, D. Grothues, P. Guerreiro, J. Hegemann, N. Hewitt, F. Hilger, C.P. Hollenberg, O. Horaitis, K.J. Indge, A. Jacquier, C.M. James, J.C. Jauniaux, A. Jimenez, H. Keuchel, L. Kirchrath, K. Kleine, P. Koetter, P. Legrain, S. Liebl, E.J. Louis, A. Maia e Silva, C. Marck, A.L. Monnier, D. Moestl, S. Mueller, B. Obermaier, S.G. Oliver, C. Pallier, S. Pascolo, F. Pfeiffer, P. Philippsen, R.J. Planta, F.M. Pohl, T.M. Pohl, R. Poehlmann, D. Porteteile, B. Purnelle, V. Puzos, M.R. Rad, S.W. Rasmussen, M. Remacha, J.L. Revuelta, G.F. Richard, M. Rieger, C. Rodrigues-Pousada, M. Rose, T. Rupp, M.A. Santos, C Schwager, C. Sensen, J. Skala, H. Soares, F. Sor, J. Stegemann, H. Tettelin, A. Thierry, M. Tzermia, L.A. Urrestarazu, L. van Dyck, J.C. van Vliet-Reedijk, M. Valens, M. Vandenbol, C. Vilela, S. Vissers, D. von Wettstein, H. Voss, S. Wiemann, G. Xu, J. Zimmermann, M. Haasemann, I. Becker, H.W. Mewes H.W; “The complete sequence of chromosome XI of Saccharomyces Cerevisiae”, Nature (1994) 396, 371–378
Google Scholar
H. Feldmann, M. Aigle, G. Aljinovic, B. Andre, M.C Baclet, A. Barthe, C. Baur, A.M. Becam, N. Biteau, E. Boles, T. Brandt, M. Brendel, M. Bruckner, F. Busereau, C. Christiansen, R. Contreras, M. Crouzet, C. Cziepluch, N. Demolis, T. Delaveau, F. Doignon, H. Domdey, S. Dusterhus, E. Dubois, B. Dujon, M. Elbakkoury, K.D. Entian, M. Feuermann, W. Fiers, G.M. Fobo, C. Fritz, H. Gassenhuber, N. Glansdorff, A. Goffeau, L.A. Grivell, M. Dehaan, C. Hein, C.J. Herbert, C.P. Hollenberg, K. Holmstrom, C. Jacq, M. Jacquet, J.C. Jauniaux, J.L. Jonniaux, T. Kallesoe, P. Kiesau, L. Kirchrath, P. Kotter, S. Koroll, S. Liebl, M. Logghe, A.J.E. Lohan, EJ. Louis, ZY. Li, M.J. Maat, L. Mallet, G. Mannhaupt, F. Messenguy, T. Miosga, F. Molemans, W. Muller, S. Nasr, B. Obermaier, J. Perea, A. Pierard, E. Piravandi, F.M. Pohl, T.M. Pohl, S. Potier, M. Proft, B. Purnelle, M.R. Rad, M. Rieger, M. Rose, I. Schaaff-Gerstenschlager, C. Scherens, B. Schwarzlose, J. Skala, P.P. Slonimski, P.H.M. Smits, J.L. Souciet, H.Y. Steensma, R. Stucka, A. Urrestarazu, Q.J.M. Vanderaart, L. Vandyck, A. Vassarotti, I. Vetter, S. Vierendeels, F. Vissers, G. Wagner, P. Dewergifosse, K.H. Wolfe, M. Zagulski, F.K. Zimmermann, H.W. Mewes, K. Kleine:’ Complete DNA-Sequence of Yeast Chromosome-II', EMBO JOURNAL (1994) 13, 5795–5809
PubMed Google Scholar
M. Johnston, S. Andrews, R. Brinkman, J. Cooper, H. Ding, J. Dover, Z. Du, A. Favello, L. Fulton, S. Gattung, C. Geisel, J. Kirsten, T. Kucaba, L. Hillier, M. Jier, L. Johnston, Y. Langston, P. Latreille, E.J. Louis, C. Macri, E. Mardis, S. Menezes, L. Mouser, M. Nhan, L. Rifkin, L. Riles, H. St. Peter, E. Trevaskis, K. Vaughan, D. Vignati, L. Wilcox, P. Wohldman, R. Waterston, R. Wilson, M. Vaudin: Compltete Nucleiotide Sequence of Saccharomyces cerevisiae Chromosome VIII. Science 256, 2077–2082 (1994)
Google Scholar
P. Bork, C. Ouzounis, C. Sander, M. Scharf, R. Schneider, E. Sonnhammer: Comprehensive sequence analysis of the 182 predicted open reading frames of yeast chromosome III. Protein Science 1:1677–1690 (1992)
PubMed Google Scholar
E.V. Koonin, P. Bork, C. Sander: Yeast chromosome III: new gene functions. EMBO Journal 13, 493–503 (1994)
PubMed Google Scholar
Dujon B. et al.,: Detailed evalutation of the complete sequence of chromosome XI of S. cerevisiae'. Manuscript in preparation.
Google Scholar
R.F. Doolitle: Of URFs and ORFs: A Primer on How to Analyze Derived Amino Acid Sequences. University Science Books, Mill Valley, CA (1987)
Google Scholar
A.M. Lesk: Computational Molecular Biology. In: Encyclopedia of Computer Science and Technology Vol. 31, Marcel Dekker, New York (1994)
Google Scholar
R.F. Doolittle: Searching through sequence databases, in: Methods in Enzymology (R.F. Doolittle edt.) 183, 99–110 (1990)
Google Scholar
P. Argos, M. Vingron, G. Vogt: Protein sequence comparison: methods and significance. Protein Engineering 4, 375–383 (1991)
PubMed Google Scholar
D.G. George, W.C. Barker, L.T. Hunt: Mutation Data Matrix and Its Uses. In: Methods in Enzymology (R.F. Doolittle edt.) 183, 333–351 (1990)
Google Scholar
S.B. Needleman, C.D. Wunsch: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48, 443–453 (1970)
PubMed Google Scholar
T.F. Smith, M.S. Waterman, W.M. Fitch: Comparative biosequence metrics. J. Mol. Evol 18, 38–46 (1981)
PubMed Google Scholar
P. Argos: A sensitive procedure to compare amino acid sequences. J. Mol. Biol. 193, 385–396 (1987)
PubMed Google Scholar
J.F. Colllins, S.F. Reddaway: High-Efficiency Sequence Database Searching: Use of the Distributed Array Processor. In: G.I. Bell, T.G. Marr (eds): Computers and DNA, Addison-Wesley (1990)
Google Scholar
W.J. Wilbur, D.J. Lipman: Rapid similarity searches of nucleic acid and protein data banks. Proc. Natl. Acad. Sci. USA 80, 726–730 (1983)
PubMed Google Scholar
S. Liebl, H.W. Mewes: A dynamic database of sequence similarities. Manuscript in preparation
Google Scholar
M.S. Waterman, M. Vingron: Rapid and accurate estimates of statistical siginificance for sequence data base searches. Proc. Natl. Acad. Sci. USA 91, 4625–4628 (1994)
PubMed Google Scholar
C. Sander, R. Schneider: Database of homology-derived protein structures and the structural meaning of sequence alignment. Protens 9, 56–68 (1991)
Google Scholar
M. Vingron, M.S. Waterman: Sequence alignment and penalty choice. J. Mol. Biol. 235, 1–12 (1994)
PubMed Google Scholar
P. Bork, R.F. Doolittle R.F.: Proposed acquisition of an animal protein domain by bacteria. Proc. Natl. Acad. Sci. USA 89, 8990–8994 (1992)
PubMed Google Scholar
P. Bork, C. Sander, A. Valencia: An ATPase domain common to prokaryotic cell cycle proteins, sugar kinases, actin, and hsp70 heat shock proteins. Proc. Natl. Acad. Sci. USA 89, 7290–7294 (1992)
PubMed Google Scholar
M. Murata, S.S. Richardson, J.L. Sussman: Simultanous comparison of three protein sequences. Proc. Natl. Acad. Sci. USA 82, 2444–2448 (1985)
Google Scholar
G.J. Barton, M.J.E. Sternberg: Flexible Protein Sequence Patterns, A Sensitive Method to Detect Weak Structural Similarities. J. Mol. Biol. 212, 389–402 (1990)
PubMed Google Scholar
M. Gribskov, R. Luthy, D. Eisenberg: Profile analysis: Detection of distantly related proteins. Proc. Natl. Acad. Sci. USA 84, 4355–4359 (1987)
PubMed Google Scholar
J.D. Thompson, D.G. Higgins, T.J. Gibbson: “Multiple sequence alignment”, Nucleic Acids Res. 22, 4673–4680 (1994)
PubMed Google Scholar
P. Argos, M. Vingron, G. Vogt. Protein sequence comparison: methods and significance. Protein Engineering 4, 375–383 (1991)
PubMed Google Scholar
Bishop J.: Nucleic Acid and Protein Sequence Analysis. A practical approach. IRL Press (1987)
Google Scholar
Meier, D., “The compelxity of some problems on subsequences and supersequences”, Jour. Assoc. Comput. Mach. 25 (2) (1978), 322–336.
Google Scholar
Knuth D.E.: The Art of Computer Programming, Vol.3, Sorting and Searching, Addison-Wessley, Reading Mass. (1973)
Google Scholar
S.F. Altschul, W. Gish, W. Miller, E.W. Myers, D.J. Lipman: Basic Local Alignment Search Tool. J. Mol. Biol. 215, 403–410 (1990)
PubMed Google Scholar
R. Baeza-Yates, G.H. Gonnet: A new Approach to Text Searching. Com. ACM 35, 10, 74–82 (1992)
Google Scholar
U. Manber, R. Baeza-Yates: An algorithm for string matching with a sequence of don't cares. Information Processing Letters 37, 133–136 (1991)
Article Google Scholar
R. Pearson: Rapid and Sensitive Sequence Comparision with FASTP and FASTA. In: Methods in Enzymology (R.F. Doolittle edt.) 183, 63–98 (1990)
Google Scholar
S. Wu, U. Manber Fast Text Searching Allowing Errors. Com. AC 35, 83–91 (1992)
Google Scholar
A. Califano, I. Rigoutsos: FLASH: A Fast Look-UP Algorithm for String Homology. In: Proceedings, First International Conference on Intelligen Sysem for Molecular Biology (Hunter L., Searls D., Shavlik J. eds.) AAAI Press, Menlo Park, CA, 56–64 (1993)
Google Scholar
U. Manber, E.W. Meyers: Suffix Arrays: A New Method for On-Line String Searches. Proceedings: First Annual ACM-SIAM Symposium on Diskrete Algorithms. 319–327 (1990)
Google Scholar
GCG, Genetic Computer Group. GCG-Manual Release 8. Madison, Wisconsin (1994)
Google Scholar
ATLAS-User's Guide. Document Version 10.0. NBRF Washington D.C. (1994)
Google Scholar
E.M. McCreight: A space-economical suffix tree construction algorithm; J. As soc. Comp. Mach. 23, 262–272 (1976)
Google Scholar
M. Kempf, R. Bayer, U. Güntzer: Time Optimal Left to Right Construction of Position Trees. Acta Informatica 24, 461–474 (1987)
Google Scholar
T.A. Sudkamp: Languages and Machines. Addison-Wesley (1988)
Google Scholar
K. Heumann:’ The hashed position tree: a dynamic, persistant variant of position trees. Mansucript in preparation.
Google Scholar
A. Bairoch: PROSITE: a dictionary of sites and patterns in proteins. Nucleic Acids Research 20, 2013–2018 (1992)
PubMed Google Scholar
J.T.L. Wang, T.G. Marr, D. Shasha, B.A. Shipiro, G.-W. Chirn: Discovering active motifs in sets of related protein sequences and using them for classification; Nucl. Acids Res. 22, 2769–2775 (1994)
PubMed Google Scholar
J.D. Ullman: Principles of Dtabase and Knowledge-Base Systems, Vol. I. Computer scinece Press, Rockville. (1988)
Google Scholar
G. Gonnet, A. Mark, S. Benner: Exhaustive Matching of the Entire Protein Sequence Database. Science 256, 1443–1445 (1992)
PubMed Google Scholar
C. Lefevre, J. Ikeda: Pattern recognition in DNA sequences and its application to consensus foot-printing. Comp. Appl. Biosc. 9, 349–354 (1993)
PubMed Google Scholar
C. Lefevere, J. Ikeda: The position end-set tree: A small automaton for ward recognition in biological sequences. Comp. Appl. Biosc. 9, 343–348 (1993)
PubMed Google Scholar
P. Bieganski, J. Riedl, J.V. Cartis: Generalized suffix trees for biological sequence data: applications and implementation. In: Proceedings of the Twenty-Seventh Hawaii International Conference on System Sciences. Vol.V: Biotechnology Computing; IEEE Comput. Soc. Press, 35–44. (1994)
Google Scholar
Object Design, Inc. (1993) Reference Manual. ObjectStore Release 3.0 Beta. For VAX/VMS Systems. Burlington.
Google Scholar

Download references

Author information

Authors and Affiliations

Max-Planck-Inst. f. Biochemie, 82152, Martinsried, Germany
H. W. Mewes & K. Heumann

Authors

H. W. Mewes
View author publications
You can also search for this author in PubMed Google Scholar
K. Heumann
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Zvi Galil Esko Ukkonen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Mewes, H.W., Heumann, K. (1995). Genome analysis: Pattern search in biological macromolecules. In: Galil, Z., Ukkonen, E. (eds) Combinatorial Pattern Matching. CPM 1995. Lecture Notes in Computer Science, vol 937. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-60044-2_48

Download citation

DOI: https://doi.org/10.1007/3-540-60044-2_48
Published: 31 May 2005
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-60044-2
Online ISBN: 978-3-540-49412-6
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics