Abstract
Consensus patterns, like motifs and tandem repeats, are highly conserved patterns with very few substitutions where no gaps are allowed. In this paper, we present a progressive hierarchical clustering technique for discovering consensus patterns in biological databases over a certain length range. This technique can discover consensus patterns with various requirements by applying a post-processing phase. The progressive nature of the hierarchical clustering algorithm makes it scalable and efficient. Experiments to discover motifs and tandem repeats on real biological databases show significant performance gain over non-progressive clustering techniques.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Aggarwal, C.C.: On effective classification of strings with wavelets. In: Proceedings of the 8th ACM SIGKDD, pp. 163–172 (2002)
Goethals, B.: Survey on frequent pattern mining (manuscript, 2003)
Agrawal, R., Faloutsos, C., Swami, A.N.: Efficient similarity search in sequence databases. In: Lomet, D.B. (ed.) FODO 1993. LNCS, vol. 730, pp. 69–84. Springer, Heidelberg (1993)
Agrawal, R., Srikant, R.: Fast algorithms for mining association rules. In: VLDB, pp. 487–499 (1994)
Aref, W.G., Barbara, D.: Supporting electronic ink databases. Information Systems: An International Journal 24(4), 303–326 (1999)
Bailey, T., Elkan, C., Grundy, B.: The meme system: Multiple EM for motif elicitation, http://bioweb.pasteur.fr/seqanal/motif/meme/
Bailey, T.L., Elkan, C.: Fitting a mixture model by expectation maximization to discover motifs in biopolymers. In: ISMB, pp. 28–36 (1994)
Bailey, T.L., Elkan, C.: The value of prior knowledge in discovering motifs with meme. In: ISMB, pp. 21–29 (1995)
Benson, G.: Tandem repeats finder: a program to analyze dna sequences. Nucleic Acids Research 27, 573–580 (1999)
Berkhin, P.: Survey of clustering data mining techniques, San Jose, CA (2002)
Buhler, J., Tompa, M.: Finding motifs using random projections. In: RECOMB, pp. 69–76 (2001)
Fix, E., Hodges, J.L.: Discriminatory analysis, nonparametric discrimination: Consistency properties. In USAF School of Aviation Medicine, Project 21-49004, Report 4 (1951)
Ganesh, R., Ioerger, T.R., Siegele, D.A.: Mopac: Motif finding by preprocessing and agglomerative clustering from microarrays. In: PSB, pp. 41–52 (2003)
Halkidi, M., Batistakis, Y., Vazirgiannis, M.: Clustering algorithms and validity measures. Tutorial paper, SSDBM (2001)
Hamming, R.W.: Coding and information theory. Prentice-Hall, Englewood Cliffs (1980)
Hertz, G.Z., Stormo, G.D.: Identifying dna and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics 15, 563–577 (1999)
Jagadish, H.V., Koudas, N., Srivastava, D.: On effective multi-dimensional indexing for strings. In: SIGMOD, pp. 403–414 (2000)
King, B.: Step-wise clustering procedures. J. Am. Stat. Assoc. 69 (1967)
Landau, G.M., Schmidt, J.P.: An algorithm for approximate tandem repeats. In: CPM, pp. 120–133 (1993)
Lawrence, C.E., Altschul, S.F., Boguski, M.S., Liu, J.S., Neuwald, A.F., Wootton, J.C.: Detecting subtle sequence signals a gibb’s sampling strategy for multiple alignment. Science 262, 208–214 (1993)
Liang, C.: Copia: A new software for finding consensus patterns in unaligned protein sequences. Master thesis, University of Waterloo (2001)
Myers, G., Sagot, M.: Identifying satellites and periodic repetitions in biological sequences. Journal of Computational Biology 10, 10–20 (1998)
Nagy, G.: State of the art in pattern recognition. Proc. IEEE 56 (1968)
Pevzner, P.A., Sze, S.: Combinatorial approaches to finding subtle signals in dna sequences. In: ISMB, pp. 269–278 (2000)
La Poutr, J.A.: New techniques for the union-find problem, pp. 54–63. SIAM, Philadelphia (1990)
Sneath, P.H., Sokal, R.R.: Numerical taxonomy. Freeman, London (1973)
Frakes, W.B., Yates, R.B. (eds.): Information Retrieval: Data Structures and Algorithms. Prentice-Hall, Englewood Cliffs (1992)
McCreight, E.M.: A space-economical suffix tree construction algorithm. Journal of ACM 23(2), 262–272 (1976)
Gusfield, D.: Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge University Press, New York (1997)
Hamada, H., Seidman, M., Howard, B., Gorman, C.: Enhanced gene expression by the poly(dT-dG) poly(dC-dA) sequence. Molecular and Cellular Biology (1984)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
ElTabakh, M.Y., Aref, W.G., Ouzzani, M., Ali, M.H. (2006). Discovering Consensus Patterns in Biological Databases. In: Dalkilic, M.M., Kim, S., Yang, J. (eds) Data Mining and Bioinformatics. VDMB 2006. Lecture Notes in Computer Science(), vol 4316. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11960669_15
Download citation
DOI: https://doi.org/10.1007/11960669_15
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-68970-6
Online ISBN: 978-3-540-68971-3
eBook Packages: Computer ScienceComputer Science (R0)