Discovering Consensus Patterns in Biological Databases

ElTabakh, Mohamed Y.; Aref, Walid G.; Ouzzani, Mourad; Ali, Mohamed H.

doi:10.1007/11960669_15

Discovering Consensus Patterns in Biological Databases

Mohamed Y. ElTabakh²¹,
Walid G. Aref²¹,
Mourad Ouzzani²² &
…
Mohamed H. Ali²¹

Conference paper

513 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 4316))

Abstract

Consensus patterns, like motifs and tandem repeats, are highly conserved patterns with very few substitutions where no gaps are allowed. In this paper, we present a progressive hierarchical clustering technique for discovering consensus patterns in biological databases over a certain length range. This technique can discover consensus patterns with various requirements by applying a post-processing phase. The progressive nature of the hierarchical clustering algorithm makes it scalable and efficient. Experiments to discover motifs and tandem repeats on real biological databases show significant performance gain over non-progressive clustering techniques.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Aggarwal, C.C.: On effective classification of strings with wavelets. In: Proceedings of the 8th ACM SIGKDD, pp. 163–172 (2002)
Google Scholar
Goethals, B.: Survey on frequent pattern mining (manuscript, 2003)
Google Scholar
Agrawal, R., Faloutsos, C., Swami, A.N.: Efficient similarity search in sequence databases. In: Lomet, D.B. (ed.) FODO 1993. LNCS, vol. 730, pp. 69–84. Springer, Heidelberg (1993)
Google Scholar
Agrawal, R., Srikant, R.: Fast algorithms for mining association rules. In: VLDB, pp. 487–499 (1994)
Google Scholar
Aref, W.G., Barbara, D.: Supporting electronic ink databases. Information Systems: An International Journal 24(4), 303–326 (1999)
Google Scholar
Bailey, T., Elkan, C., Grundy, B.: The meme system: Multiple EM for motif elicitation, http://bioweb.pasteur.fr/seqanal/motif/meme/
Bailey, T.L., Elkan, C.: Fitting a mixture model by expectation maximization to discover motifs in biopolymers. In: ISMB, pp. 28–36 (1994)
Google Scholar
Bailey, T.L., Elkan, C.: The value of prior knowledge in discovering motifs with meme. In: ISMB, pp. 21–29 (1995)
Google Scholar
Benson, G.: Tandem repeats finder: a program to analyze dna sequences. Nucleic Acids Research 27, 573–580 (1999)
Article Google Scholar
Berkhin, P.: Survey of clustering data mining techniques, San Jose, CA (2002)
Google Scholar
Buhler, J., Tompa, M.: Finding motifs using random projections. In: RECOMB, pp. 69–76 (2001)
Google Scholar
Fix, E., Hodges, J.L.: Discriminatory analysis, nonparametric discrimination: Consistency properties. In USAF School of Aviation Medicine, Project 21-49004, Report 4 (1951)
Google Scholar
Ganesh, R., Ioerger, T.R., Siegele, D.A.: Mopac: Motif finding by preprocessing and agglomerative clustering from microarrays. In: PSB, pp. 41–52 (2003)
Google Scholar
Halkidi, M., Batistakis, Y., Vazirgiannis, M.: Clustering algorithms and validity measures. Tutorial paper, SSDBM (2001)
Google Scholar
Hamming, R.W.: Coding and information theory. Prentice-Hall, Englewood Cliffs (1980)
MATH Google Scholar
Hertz, G.Z., Stormo, G.D.: Identifying dna and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics 15, 563–577 (1999)
Article Google Scholar
Jagadish, H.V., Koudas, N., Srivastava, D.: On effective multi-dimensional indexing for strings. In: SIGMOD, pp. 403–414 (2000)
Google Scholar
King, B.: Step-wise clustering procedures. J. Am. Stat. Assoc. 69 (1967)
Google Scholar
Landau, G.M., Schmidt, J.P.: An algorithm for approximate tandem repeats. In: CPM, pp. 120–133 (1993)
Google Scholar
Lawrence, C.E., Altschul, S.F., Boguski, M.S., Liu, J.S., Neuwald, A.F., Wootton, J.C.: Detecting subtle sequence signals a gibb’s sampling strategy for multiple alignment. Science 262, 208–214 (1993)
Article Google Scholar
Liang, C.: Copia: A new software for finding consensus patterns in unaligned protein sequences. Master thesis, University of Waterloo (2001)
Google Scholar
Myers, G., Sagot, M.: Identifying satellites and periodic repetitions in biological sequences. Journal of Computational Biology 10, 10–20 (1998)
Google Scholar
Nagy, G.: State of the art in pattern recognition. Proc. IEEE 56 (1968)
Google Scholar
Pevzner, P.A., Sze, S.: Combinatorial approaches to finding subtle signals in dna sequences. In: ISMB, pp. 269–278 (2000)
Google Scholar
La Poutr, J.A.: New techniques for the union-find problem, pp. 54–63. SIAM, Philadelphia (1990)
Google Scholar
Sneath, P.H., Sokal, R.R.: Numerical taxonomy. Freeman, London (1973)
MATH Google Scholar
Frakes, W.B., Yates, R.B. (eds.): Information Retrieval: Data Structures and Algorithms. Prentice-Hall, Englewood Cliffs (1992)
Google Scholar
McCreight, E.M.: A space-economical suffix tree construction algorithm. Journal of ACM 23(2), 262–272 (1976)
Article MATH MathSciNet Google Scholar
Gusfield, D.: Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge University Press, New York (1997)
Book MATH Google Scholar
Hamada, H., Seidman, M., Howard, B., Gorman, C.: Enhanced gene expression by the poly(dT-dG) poly(dC-dA) sequence. Molecular and Cellular Biology (1984)
Google Scholar

Download references

Author information

Authors and Affiliations

Dept. of Computer Science, Purdue University, West Lafayette, IN, 47906, USA
Mohamed Y. ElTabakh, Walid G. Aref & Mohamed H. Ali
Cyber Center, Purdue University, West Lafayette, IN, 47906, USA
Mourad Ouzzani

Authors

Mohamed Y. ElTabakh
View author publications
You can also search for this author in PubMed Google Scholar
Walid G. Aref
View author publications
You can also search for this author in PubMed Google Scholar
Mourad Ouzzani
View author publications
You can also search for this author in PubMed Google Scholar
Mohamed H. Ali
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Informatics, Indiana University, 901 E. 10th Street, 47408, Bloomington, IN,
Mehmet M. Dalkilic & Sun Kim &
EECS Department, Case Western Reserve Univ., 10900 Euclid Ave, 44106, Cleveland, OH, USA
Jiong Yang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

ElTabakh, M.Y., Aref, W.G., Ouzzani, M., Ali, M.H. (2006). Discovering Consensus Patterns in Biological Databases. In: Dalkilic, M.M., Kim, S., Yang, J. (eds) Data Mining and Bioinformatics. VDMB 2006. Lecture Notes in Computer Science(), vol 4316. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11960669_15

Download citation

DOI: https://doi.org/10.1007/11960669_15
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-68970-6
Online ISBN: 978-3-540-68971-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics