Discovering Almost Any Hidden Motif from Multiple Sequences in Polynomial Time with Low Sample Complexity and High Success Probability

Fu, Bin; Kao, Ming-Yang; Wang, Lusheng

doi:10.1007/978-3-642-02017-9_26

Discovering Almost Any Hidden Motif from Multiple Sequences in Polynomial Time with Low Sample Complexity and High Success Probability

Bin Fu¹⁸,
Ming-Yang Kao¹⁹ &
Lusheng Wang²⁰

Conference paper

591 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 5532))

Abstract

We study a natural probabilistic model for motif discovery that has been used to experimentally test the effectiveness of motif discovery programs. In this model, there are k background sequences, and each character in a background sequence is a random character from an alphabet Σ. A motif G = g ₁ g ₂...g _m is a string of m characters. Each background sequence is implanted a probabilistically generated approximate copy of G. For a probabilistically generated approximate copy b ₁ b ₂...b _m of G, every character is probabilistically generated such that the probability for b _i ≠ g _i is at most α. It has been conjectured that multiple background sequences can help with finding faint motifs G.

In this paper, we develop an efficient algorithm that can discover a hidden motif from a set of sequences for any alphabet Σ with |Σ| ≥ 2 and is applicable to DNA motif discovery. We prove that for \(\alpha<{1\over 4}(1-{1\over |\Sigma|})\) and any constant x ≥ 8, there exist positive constants c ₀, ε, δ ₁ and δ ₂ such that if the length ρ of motif G is at least δ ₁ logn, and there are k ≥ c ₀ logn input sequences, then in O(n ² + kn) time this algorithm finds the motif with probability at least \(1-{1\over 2^x}\) for every \(G\in \Sigma^{\rho}-\Psi_{\rho, h,\epsilon}(\Sigma)\), where ρ is the length of the motif, h is a parameter with ρ ≥ 4h ≥ δ ₂logn, and Ψ _{ρ, h,ε}(Σ) is a small subset of at most \(2^{-\Theta(\epsilon^2 h)}\) fraction of the sequences in Σ ^ρ. The constants c ₀, ε, δ ₁ and δ ₂ do not depend on x when x is a parameter of order O(logn). Our algorithm can take any number k sequences as input.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Chin, F., Leung, H.: Voting algorithms for discovering long motifs. In: Proceedings of the 3rd Asia-Pacific Bioinformatics Conference, pp. 261–272 (2005)
Google Scholar
Dopazo, J., Rodríguez, A., Sáiz, J.C., Sobrino, F.: Design of primers for PCR amplification of highly variable genomes. Computer Applications in the Biosciences 9, 123–125 (1993)
Google Scholar
Frances, M., Litman, A.: On covering problems of codes. Theoretical Computer Science 30, 113–119 (1997)
MATH MathSciNet Google Scholar
Fu, B., Kao, M.-Y., Wang, L.: Efficient algorithms for model-based motif discovery from multiple sequences. In: Agrawal, M., Du, D.-Z., Duan, Z., Li, A. (eds.) TAMC 2008. LNCS, vol. 4978, pp. 234–245. Springer, Heidelberg (2008)
Chapter Google Scholar
Ga̧sieniec, L., Jansson, J., Lingas, A.: Efficient approximation algorithms for the Hamming center problem. In: Proceedings of the 10th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. S905–S906 (1999)
Google Scholar
Gusfield, D.: Algorithms on Strings, Trees, and Sequences. Cambridge University Press, Cambridge (1997)
MATH Google Scholar
Hertz, G., Stormo, G.: Identification of consensus patterns in unaligned DNA and protein sequences: a large-deviation statistical basis for penalizing gaps. In: Proceedings of the 3rd International Conference on Bioinformatics and Genome Research, pp. 201–216 (1995)
Google Scholar
Keich, U., Pevzner, P.: Finding motifs in the twilight zone. Bioinformatics 18, 1374–1381 (2002)
Article Google Scholar
Keich, U., Pevzner, P.: Subtle motifs: defining the limits of motif finding algorithms. Bioinformatics 18, 1382–1390 (2002)
Article Google Scholar
Lanctot, J.K., Li, M., Ma, B., Wang, L., Zhang, L.: Distinguishing string selection problems. In: Proceedings of the 10th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 633–642 (1999)
Google Scholar
Lawrence, C., Reilly, A.: An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Proteins 7, 41–51 (1990)
Article Google Scholar
Li, M., Ma, B., Wang, L.: Finding similar regions in many strings. In: Proceedings of the 31st Annual ACM Symposium on Theory of Computing, pp. 473–482 (1999)
Google Scholar
Li, M., Ma, B., Wang, L.: On the closest string and substring problems. Journal of the ACM 49(2), 157–171 (2002)
Article MathSciNet Google Scholar
Lucas, K., Busch, M., Mossinger, S., Thompson, J.: An improved microcomputer program for finding gene- or gene family-specific oligonucleotides suitable as primers for polymerase chain reactions or as probes. Computer Applications in the Biosciences 7, 525–529 (1991)
Google Scholar
Motwani, R., Raghavan, P.: Randomized Algorithms. Cambridge University Press, Cambridge (2000)
Google Scholar
Pevzner, P., Sze, S.: Combinatorial approaches to finding subtle signals in DNA sequences. In: Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology, pp. 269–278 (2000)
Google Scholar
Proutski, V., Holme, E.C.: Primer master: a new program for the design and analysis of PCR primers. Computer Applications in the Biosciences 12, 253–255 (1996)
Google Scholar
Stormo, G.: Consensus patterns in DNA. In: Doolitle, R.F. (ed.) Molecular evolution: computer analysis of protein and nucleic acid sequences. Methods in Enzymology, vol. 183, pp. 211–221 (1990)
Google Scholar
Stormo, G., Hartzell III, G.: Identifying protein-binding sites from unaligned DNA fragments. In: Proceedings of the National Academy of Sciences of the United States of America, vol. 88, pp. 5699–5703 (1991)
Google Scholar
Wang, L., Dong, L.: Randomized algorithms for motif detection. Journal of Bioinformatics and Computational Biology 3(5), 1039–1052 (2005)
Article MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Dept. of Computer Science, University of Texas – Pan American, TX 78539, USA
Bin Fu
Department of Electrical Engineering and Computer Science, Northwestern University, Evanston, IL 60208, USA
Ming-Yang Kao
Department of Computer Science, The City University of Hong Kong, Kowloon, Hong Kong
Lusheng Wang

Authors

Bin Fu
View author publications
You can also search for this author in PubMed Google Scholar
Ming-Yang Kao
View author publications
You can also search for this author in PubMed Google Scholar
Lusheng Wang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science and Engineering, Texas A&M University, College Station, 77843, Texas, USA
Jianer Chen
School of Mathematics, University of Leeds, LS2 9JT, U.K.
S. Barry Cooper

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Fu, B., Kao, MY., Wang, L. (2009). Discovering Almost Any Hidden Motif from Multiple Sequences in Polynomial Time with Low Sample Complexity and High Success Probability. In: Chen, J., Cooper, S.B. (eds) Theory and Applications of Models of Computation. TAMC 2009. Lecture Notes in Computer Science, vol 5532. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-02017-9_26

Download citation

DOI: https://doi.org/10.1007/978-3-642-02017-9_26
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-02016-2
Online ISBN: 978-3-642-02017-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics