Skip to main content

Discovering Almost Any Hidden Motif from Multiple Sequences in Polynomial Time with Low Sample Complexity and High Success Probability

  • Conference paper
  • 591 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 5532))

Abstract

We study a natural probabilistic model for motif discovery that has been used to experimentally test the effectiveness of motif discovery programs. In this model, there are k background sequences, and each character in a background sequence is a random character from an alphabet Σ. A motif G = g 1 g 2...g m is a string of m characters. Each background sequence is implanted a probabilistically generated approximate copy of G. For a probabilistically generated approximate copy b 1 b 2...b m of G, every character is probabilistically generated such that the probability for b i  ≠ g i is at most α. It has been conjectured that multiple background sequences can help with finding faint motifs G.

In this paper, we develop an efficient algorithm that can discover a hidden motif from a set of sequences for any alphabet Σ with |Σ| ≥ 2 and is applicable to DNA motif discovery. We prove that for \(\alpha<{1\over 4}(1-{1\over |\Sigma|})\) and any constant x ≥ 8, there exist positive constants c 0, ε, δ 1 and δ 2 such that if the length ρ of motif G is at least δ 1 logn, and there are k ≥ c 0 logn input sequences, then in O(n 2 + kn) time this algorithm finds the motif with probability at least \(1-{1\over 2^x}\) for every \(G\in \Sigma^{\rho}-\Psi_{\rho, h,\epsilon}(\Sigma)\), where ρ is the length of the motif, h is a parameter with ρ ≥ 4h ≥ δ 2logn, and Ψ ρ, h,ε (Σ) is a small subset of at most \(2^{-\Theta(\epsilon^2 h)}\) fraction of the sequences in Σ ρ. The constants c 0, ε, δ 1 and δ 2 do not depend on x when x is a parameter of order O(logn). Our algorithm can take any number k sequences as input.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Chin, F., Leung, H.: Voting algorithms for discovering long motifs. In: Proceedings of the 3rd Asia-Pacific Bioinformatics Conference, pp. 261–272 (2005)

    Google Scholar 

  2. Dopazo, J., Rodríguez, A., Sáiz, J.C., Sobrino, F.: Design of primers for PCR amplification of highly variable genomes. Computer Applications in the Biosciences 9, 123–125 (1993)

    Google Scholar 

  3. Frances, M., Litman, A.: On covering problems of codes. Theoretical Computer Science 30, 113–119 (1997)

    MATH  MathSciNet  Google Scholar 

  4. Fu, B., Kao, M.-Y., Wang, L.: Efficient algorithms for model-based motif discovery from multiple sequences. In: Agrawal, M., Du, D.-Z., Duan, Z., Li, A. (eds.) TAMC 2008. LNCS, vol. 4978, pp. 234–245. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  5. Ga̧sieniec, L., Jansson, J., Lingas, A.: Efficient approximation algorithms for the Hamming center problem. In: Proceedings of the 10th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. S905–S906 (1999)

    Google Scholar 

  6. Gusfield, D.: Algorithms on Strings, Trees, and Sequences. Cambridge University Press, Cambridge (1997)

    MATH  Google Scholar 

  7. Hertz, G., Stormo, G.: Identification of consensus patterns in unaligned DNA and protein sequences: a large-deviation statistical basis for penalizing gaps. In: Proceedings of the 3rd International Conference on Bioinformatics and Genome Research, pp. 201–216 (1995)

    Google Scholar 

  8. Keich, U., Pevzner, P.: Finding motifs in the twilight zone. Bioinformatics 18, 1374–1381 (2002)

    Article  Google Scholar 

  9. Keich, U., Pevzner, P.: Subtle motifs: defining the limits of motif finding algorithms. Bioinformatics 18, 1382–1390 (2002)

    Article  Google Scholar 

  10. Lanctot, J.K., Li, M., Ma, B., Wang, L., Zhang, L.: Distinguishing string selection problems. In: Proceedings of the 10th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 633–642 (1999)

    Google Scholar 

  11. Lawrence, C., Reilly, A.: An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Proteins 7, 41–51 (1990)

    Article  Google Scholar 

  12. Li, M., Ma, B., Wang, L.: Finding similar regions in many strings. In: Proceedings of the 31st Annual ACM Symposium on Theory of Computing, pp. 473–482 (1999)

    Google Scholar 

  13. Li, M., Ma, B., Wang, L.: On the closest string and substring problems. Journal of the ACM 49(2), 157–171 (2002)

    Article  MathSciNet  Google Scholar 

  14. Lucas, K., Busch, M., Mossinger, S., Thompson, J.: An improved microcomputer program for finding gene- or gene family-specific oligonucleotides suitable as primers for polymerase chain reactions or as probes. Computer Applications in the Biosciences 7, 525–529 (1991)

    Google Scholar 

  15. Motwani, R., Raghavan, P.: Randomized Algorithms. Cambridge University Press, Cambridge (2000)

    Google Scholar 

  16. Pevzner, P., Sze, S.: Combinatorial approaches to finding subtle signals in DNA sequences. In: Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology, pp. 269–278 (2000)

    Google Scholar 

  17. Proutski, V., Holme, E.C.: Primer master: a new program for the design and analysis of PCR primers. Computer Applications in the Biosciences 12, 253–255 (1996)

    Google Scholar 

  18. Stormo, G.: Consensus patterns in DNA. In: Doolitle, R.F. (ed.) Molecular evolution: computer analysis of protein and nucleic acid sequences. Methods in Enzymology, vol. 183, pp. 211–221 (1990)

    Google Scholar 

  19. Stormo, G., Hartzell III, G.: Identifying protein-binding sites from unaligned DNA fragments. In: Proceedings of the National Academy of Sciences of the United States of America, vol. 88, pp. 5699–5703 (1991)

    Google Scholar 

  20. Wang, L., Dong, L.: Randomized algorithms for motif detection. Journal of Bioinformatics and Computational Biology 3(5), 1039–1052 (2005)

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Fu, B., Kao, MY., Wang, L. (2009). Discovering Almost Any Hidden Motif from Multiple Sequences in Polynomial Time with Low Sample Complexity and High Success Probability. In: Chen, J., Cooper, S.B. (eds) Theory and Applications of Models of Computation. TAMC 2009. Lecture Notes in Computer Science, vol 5532. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-02017-9_26

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-02017-9_26

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-02016-2

  • Online ISBN: 978-3-642-02017-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics