Detecting Motifs in a Large Data Set: Applying Probabilistic Insights to Motif Finding

Boucher, Christina; Brown, Daniel G.

doi:10.1007/978-3-642-00727-9_15

Detecting Motifs in a Large Data Set: Applying Probabilistic Insights to Motif Finding

Christina Boucher²⁰ &
Daniel G. Brown²⁰

Conference paper

1131 Accesses
4 Citations

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 5462))

Abstract

We give a probabilistic algorithm for Consensus Sequence, a NP-complete subproblem of motif recognition, that can be described as follows: given set of l-length sequences, determine if there exists a sequence that has Hamming distance at most d from every sequence. We demonstrate that distance between a randomly selected majority sequence and a consensus sequence decreases as the size of the data set increases. Applying our probabilistic paradigms and insights to motif recognition we develop pMCL-WMR, a program capable of detecting motifs in large synthetic and real-genomic data sets. Our results show that detecting motifs in data sets increases in ease and efficiency when the size of set of sequence increases, a surprising and counter-intuitive fact that has significant impact on this deeply-investigated area.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Bollobas, B., Janson, S., Riordan, O.: The phase transition in inhomogeneous random graphs. Random. Struct. Algor. 31, 3–122 (2007)
Article Google Scholar
Boucher, C., Brown, D., Church, P.: A graph clustering approach to weak motif recognition. In: Giancarlo, R., Hannenhalli, S. (eds.) WABI 2007. LNCS (LNBI), vol. 4645, pp. 149–160. Springer, Heidelberg (2007)
Chapter Google Scholar
Buhler, J., Tompa, M.: Finding motifs using random projections. J. Comput. Biol. 9(3), 225–242 (2002)
Article CAS PubMed Google Scholar
Chin, F.Y.L., Leung, C.M.: Voting algorithms for discovering long motifs. In: Proc. APBC 2005, pp. 261–271 (2005)
Google Scholar
Crawford, J.M., Auton, L.D.: Experimental results on the crossover point in satisfiability problems. In: Proc. AAAI 1993, pp. 21–27 (1993)
Google Scholar
Eskin, E., Pevzner, P.A.: Finding composite regulatory patterns in DNA sequences. Bioinformatics 18(1), 354–363 (2002)
Article Google Scholar
Evans, P.A., Smith, A., Wareham, H.T.: On the complexity of finding common approximate substrings. Th. Comp. Sci. 306, 407–430 (2003)
Article Google Scholar
Feng, W., Wang, Z., Wang, L.: Identification of distinguishing motifs. In: Ma, B., Zhang, K. (eds.) CPM 2007. LNCS, vol. 4580, pp. 253–264. Springer, Heidelberg (2007)
Chapter Google Scholar
Frances, M., Litman, A.: On covering problems of codes. Th. Comp. Sys. 30, 113–119 (1997)
Article Google Scholar
Davila, J., Balla, S.: Rajasekaran. Fast and practical algorithms for planted (l, d) motif search. IEEE/ACM Trans. Comput. Biol. Bioinf. 4(4), 544–552 (2007)
Article CAS Google Scholar
Li, M., Ma, B., Wang, L.: Finding similar regions in many strings. J. Comp. and Sys. Sci. 65(1), 73–96 (2002)
Article Google Scholar
Koutsoupias, E., Papadimitriou, C.H.: On the greedy algorithm for satisfiability. Inform. Process. Lett. 43, 53–55 (1992)
Article Google Scholar
Motwani, R., Raghavan, R.: Randomized Algorithms. Cambridge University Press, New York (1995)
Book Google Scholar
Papadimitriou, C.H.: On selecting a satisfying truth assignment. In: Proc. FOCS 1991, pp. 163–169 (1991)
Google Scholar
Pavesi, G., Mauri, G., Pesole, G.: An algorithm for finding signals of unknown length in DNA sequences. Bioinformatics 17, S207–S214 (2001)
Article Google Scholar
Pennock, D.M., Stout., Q.F.: Exploiting a theory of phase transitions in three-satisfiability problems. In: Proc. AAAI 1996, pp. 253–258 (1996)
Google Scholar
Pevzner, P., Sze, S.: Combinatorial approaches to finding subtle signals in DNA sequences. In: Proc. ISMB 2000, pp. 344–354 (2000)
Google Scholar
Rajasekaran, S., Balla, S., Huang, C.H.: Exact algorithms for the planted motif problem. J. Comp. Bio. 12(8), 1117–1128 (2005)
Article CAS Google Scholar
Sagot, M.-F.: Spelling approximate repeated or common motifs using a suffix tree. In: Lucchesi, C.L., Moura, A.V. (eds.) LATIN 1998. LNCS, vol. 1380, pp. 374–390. Springer, Heidelberg (1998)
Chapter Google Scholar
Schöning, U.: A probabilistic algorithm for k-SAT and constraint satisfaction problems. In: Proc. FOCS 1999, pp. 410–414 (1999)
Google Scholar
Sze, S., Lu, S., Chen, J.: Integrating sample-driven and pattern-driven approaches in motif finding. In: Jonassen, I., Kim, J. (eds.) WABI 2004. LNCS (LNBI), vol. 3240, pp. 438–449. Springer, Heidelberg (2004)
Chapter Google Scholar
Tompa, M., Li, N., Bailey, T.L., Church, G.M., De Moor, B., Eskin, E., Favorov, A.V., Frith, M.C., Fu, Y., Kent, W.J., Makeev, V.J., Mironov, A.A., Noble, W.S., Pavesi, G., Pesole, G., Régnier, M., Simonis, N., Sinha, S., Thijs, G., van Helden, J., Vandenbogaert, M., Weng, Z., Workman, C., Ye, C., Zhu, Z.: Assessing computational tools for the discovery of transcription factor binding sites. Nat. Biotechnol. 23, 137–144 (2005)
Article CAS PubMed Google Scholar
Wingender, E., Dietze, P., Karas, H., Knüppel, R.: TRANSFAC: a database on transcription factors and their DNA binding sites. Nucleic Acids Res. 24(1), 238–241 (1996)
Article CAS PubMed PubMed Central Google Scholar

Download references

Author information

Authors and Affiliations

David R.Cheriton School of Computer Science, University of Waterloo, Waterloo, Ontario, Canada, N2L 3G1
Christina Boucher & Daniel G. Brown

Authors

Christina Boucher
View author publications
You can also search for this author in PubMed Google Scholar
Daniel G. Brown
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science and Engineering, University of Connecticut, 257 ITE Building, 371 Fairfield Way, CT 06269-2155, Storrs, USA
Sanguthevar Rajasekaran

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Boucher, C., Brown, D.G. (2009). Detecting Motifs in a Large Data Set: Applying Probabilistic Insights to Motif Finding. In: Rajasekaran, S. (eds) Bioinformatics and Computational Biology. BICoB 2009. Lecture Notes in Computer Science(), vol 5462. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-00727-9_15

Download citation

DOI: https://doi.org/10.1007/978-3-642-00727-9_15
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-00726-2
Online ISBN: 978-3-642-00727-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics