Skip to main content

Discovering Consensus Patterns in Biological Databases

  • Conference paper
  • 513 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 4316))

Abstract

Consensus patterns, like motifs and tandem repeats, are highly conserved patterns with very few substitutions where no gaps are allowed. In this paper, we present a progressive hierarchical clustering technique for discovering consensus patterns in biological databases over a certain length range. This technique can discover consensus patterns with various requirements by applying a post-processing phase. The progressive nature of the hierarchical clustering algorithm makes it scalable and efficient. Experiments to discover motifs and tandem repeats on real biological databases show significant performance gain over non-progressive clustering techniques.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Aggarwal, C.C.: On effective classification of strings with wavelets. In: Proceedings of the 8th ACM SIGKDD, pp. 163–172 (2002)

    Google Scholar 

  2. Goethals, B.: Survey on frequent pattern mining (manuscript, 2003)

    Google Scholar 

  3. Agrawal, R., Faloutsos, C., Swami, A.N.: Efficient similarity search in sequence databases. In: Lomet, D.B. (ed.) FODO 1993. LNCS, vol. 730, pp. 69–84. Springer, Heidelberg (1993)

    Google Scholar 

  4. Agrawal, R., Srikant, R.: Fast algorithms for mining association rules. In: VLDB, pp. 487–499 (1994)

    Google Scholar 

  5. Aref, W.G., Barbara, D.: Supporting electronic ink databases. Information Systems: An International Journal 24(4), 303–326 (1999)

    Google Scholar 

  6. Bailey, T., Elkan, C., Grundy, B.: The meme system: Multiple EM for motif elicitation, http://bioweb.pasteur.fr/seqanal/motif/meme/

  7. Bailey, T.L., Elkan, C.: Fitting a mixture model by expectation maximization to discover motifs in biopolymers. In: ISMB, pp. 28–36 (1994)

    Google Scholar 

  8. Bailey, T.L., Elkan, C.: The value of prior knowledge in discovering motifs with meme. In: ISMB, pp. 21–29 (1995)

    Google Scholar 

  9. Benson, G.: Tandem repeats finder: a program to analyze dna sequences. Nucleic Acids Research 27, 573–580 (1999)

    Article  Google Scholar 

  10. Berkhin, P.: Survey of clustering data mining techniques, San Jose, CA (2002)

    Google Scholar 

  11. Buhler, J., Tompa, M.: Finding motifs using random projections. In: RECOMB, pp. 69–76 (2001)

    Google Scholar 

  12. Fix, E., Hodges, J.L.: Discriminatory analysis, nonparametric discrimination: Consistency properties. In USAF School of Aviation Medicine, Project 21-49004, Report 4 (1951)

    Google Scholar 

  13. Ganesh, R., Ioerger, T.R., Siegele, D.A.: Mopac: Motif finding by preprocessing and agglomerative clustering from microarrays. In: PSB, pp. 41–52 (2003)

    Google Scholar 

  14. Halkidi, M., Batistakis, Y., Vazirgiannis, M.: Clustering algorithms and validity measures. Tutorial paper, SSDBM (2001)

    Google Scholar 

  15. Hamming, R.W.: Coding and information theory. Prentice-Hall, Englewood Cliffs (1980)

    MATH  Google Scholar 

  16. Hertz, G.Z., Stormo, G.D.: Identifying dna and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics 15, 563–577 (1999)

    Article  Google Scholar 

  17. Jagadish, H.V., Koudas, N., Srivastava, D.: On effective multi-dimensional indexing for strings. In: SIGMOD, pp. 403–414 (2000)

    Google Scholar 

  18. King, B.: Step-wise clustering procedures. J. Am. Stat. Assoc. 69 (1967)

    Google Scholar 

  19. Landau, G.M., Schmidt, J.P.: An algorithm for approximate tandem repeats. In: CPM, pp. 120–133 (1993)

    Google Scholar 

  20. Lawrence, C.E., Altschul, S.F., Boguski, M.S., Liu, J.S., Neuwald, A.F., Wootton, J.C.: Detecting subtle sequence signals a gibb’s sampling strategy for multiple alignment. Science 262, 208–214 (1993)

    Article  Google Scholar 

  21. Liang, C.: Copia: A new software for finding consensus patterns in unaligned protein sequences. Master thesis, University of Waterloo (2001)

    Google Scholar 

  22. Myers, G., Sagot, M.: Identifying satellites and periodic repetitions in biological sequences. Journal of Computational Biology 10, 10–20 (1998)

    Google Scholar 

  23. Nagy, G.: State of the art in pattern recognition. Proc. IEEE 56 (1968)

    Google Scholar 

  24. Pevzner, P.A., Sze, S.: Combinatorial approaches to finding subtle signals in dna sequences. In: ISMB, pp. 269–278 (2000)

    Google Scholar 

  25. La Poutr, J.A.: New techniques for the union-find problem, pp. 54–63. SIAM, Philadelphia (1990)

    Google Scholar 

  26. Sneath, P.H., Sokal, R.R.: Numerical taxonomy. Freeman, London (1973)

    MATH  Google Scholar 

  27. Frakes, W.B., Yates, R.B. (eds.): Information Retrieval: Data Structures and Algorithms. Prentice-Hall, Englewood Cliffs (1992)

    Google Scholar 

  28. McCreight, E.M.: A space-economical suffix tree construction algorithm. Journal of ACM 23(2), 262–272 (1976)

    Article  MATH  MathSciNet  Google Scholar 

  29. Gusfield, D.: Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge University Press, New York (1997)

    Book  MATH  Google Scholar 

  30. Hamada, H., Seidman, M., Howard, B., Gorman, C.: Enhanced gene expression by the poly(dT-dG) poly(dC-dA) sequence. Molecular and Cellular Biology (1984)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

ElTabakh, M.Y., Aref, W.G., Ouzzani, M., Ali, M.H. (2006). Discovering Consensus Patterns in Biological Databases. In: Dalkilic, M.M., Kim, S., Yang, J. (eds) Data Mining and Bioinformatics. VDMB 2006. Lecture Notes in Computer Science(), vol 4316. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11960669_15

Download citation

  • DOI: https://doi.org/10.1007/11960669_15

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-68970-6

  • Online ISBN: 978-3-540-68971-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics