Abstract
We study a natural probabilistic model for motif discovery that has been used to experimentally test the effectiveness of motif discovery programs. In this model, there are k background sequences, and each character in a background sequence is a random character from an alphabet Σ. A motif G = g 1 g 2...g m is a string of m characters. Each background sequence is implanted a probabilistically generated approximate copy of G. For a probabilistically generated approximate copy b 1 b 2...b m of G, every character is probabilistically generated such that the probability for b i ≠ g i is at most α. It has been conjectured that multiple background sequences can help with finding faint motifs G.
In this paper, we develop an efficient algorithm that can discover a hidden motif from a set of sequences for any alphabet Σ with |Σ| ≥ 2 and is applicable to DNA motif discovery. We prove that for \(\alpha<{1\over 4}(1-{1\over |\Sigma|})\) and any constant x ≥ 8, there exist positive constants c 0, ε, δ 1 and δ 2 such that if the length ρ of motif G is at least δ 1 logn, and there are k ≥ c 0 logn input sequences, then in O(n 2 + kn) time this algorithm finds the motif with probability at least \(1-{1\over 2^x}\) for every \(G\in \Sigma^{\rho}-\Psi_{\rho, h,\epsilon}(\Sigma)\), where ρ is the length of the motif, h is a parameter with ρ ≥ 4h ≥ δ 2logn, and Ψ ρ, h,ε (Σ) is a small subset of at most \(2^{-\Theta(\epsilon^2 h)}\) fraction of the sequences in Σ ρ. The constants c 0, ε, δ 1 and δ 2 do not depend on x when x is a parameter of order O(logn). Our algorithm can take any number k sequences as input.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Chin, F., Leung, H.: Voting algorithms for discovering long motifs. In: Proceedings of the 3rd Asia-Pacific Bioinformatics Conference, pp. 261–272 (2005)
Dopazo, J., Rodríguez, A., Sáiz, J.C., Sobrino, F.: Design of primers for PCR amplification of highly variable genomes. Computer Applications in the Biosciences 9, 123–125 (1993)
Frances, M., Litman, A.: On covering problems of codes. Theoretical Computer Science 30, 113–119 (1997)
Fu, B., Kao, M.-Y., Wang, L.: Efficient algorithms for model-based motif discovery from multiple sequences. In: Agrawal, M., Du, D.-Z., Duan, Z., Li, A. (eds.) TAMC 2008. LNCS, vol. 4978, pp. 234–245. Springer, Heidelberg (2008)
Ga̧sieniec, L., Jansson, J., Lingas, A.: Efficient approximation algorithms for the Hamming center problem. In: Proceedings of the 10th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. S905–S906 (1999)
Gusfield, D.: Algorithms on Strings, Trees, and Sequences. Cambridge University Press, Cambridge (1997)
Hertz, G., Stormo, G.: Identification of consensus patterns in unaligned DNA and protein sequences: a large-deviation statistical basis for penalizing gaps. In: Proceedings of the 3rd International Conference on Bioinformatics and Genome Research, pp. 201–216 (1995)
Keich, U., Pevzner, P.: Finding motifs in the twilight zone. Bioinformatics 18, 1374–1381 (2002)
Keich, U., Pevzner, P.: Subtle motifs: defining the limits of motif finding algorithms. Bioinformatics 18, 1382–1390 (2002)
Lanctot, J.K., Li, M., Ma, B., Wang, L., Zhang, L.: Distinguishing string selection problems. In: Proceedings of the 10th Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 633–642 (1999)
Lawrence, C., Reilly, A.: An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Proteins 7, 41–51 (1990)
Li, M., Ma, B., Wang, L.: Finding similar regions in many strings. In: Proceedings of the 31st Annual ACM Symposium on Theory of Computing, pp. 473–482 (1999)
Li, M., Ma, B., Wang, L.: On the closest string and substring problems. Journal of the ACM 49(2), 157–171 (2002)
Lucas, K., Busch, M., Mossinger, S., Thompson, J.: An improved microcomputer program for finding gene- or gene family-specific oligonucleotides suitable as primers for polymerase chain reactions or as probes. Computer Applications in the Biosciences 7, 525–529 (1991)
Motwani, R., Raghavan, P.: Randomized Algorithms. Cambridge University Press, Cambridge (2000)
Pevzner, P., Sze, S.: Combinatorial approaches to finding subtle signals in DNA sequences. In: Proceedings of the 8th International Conference on Intelligent Systems for Molecular Biology, pp. 269–278 (2000)
Proutski, V., Holme, E.C.: Primer master: a new program for the design and analysis of PCR primers. Computer Applications in the Biosciences 12, 253–255 (1996)
Stormo, G.: Consensus patterns in DNA. In: Doolitle, R.F. (ed.) Molecular evolution: computer analysis of protein and nucleic acid sequences. Methods in Enzymology, vol. 183, pp. 211–221 (1990)
Stormo, G., Hartzell III, G.: Identifying protein-binding sites from unaligned DNA fragments. In: Proceedings of the National Academy of Sciences of the United States of America, vol. 88, pp. 5699–5703 (1991)
Wang, L., Dong, L.: Randomized algorithms for motif detection. Journal of Bioinformatics and Computational Biology 3(5), 1039–1052 (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Fu, B., Kao, MY., Wang, L. (2009). Discovering Almost Any Hidden Motif from Multiple Sequences in Polynomial Time with Low Sample Complexity and High Success Probability. In: Chen, J., Cooper, S.B. (eds) Theory and Applications of Models of Computation. TAMC 2009. Lecture Notes in Computer Science, vol 5532. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-02017-9_26
Download citation
DOI: https://doi.org/10.1007/978-3-642-02017-9_26
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-02016-2
Online ISBN: 978-3-642-02017-9
eBook Packages: Computer ScienceComputer Science (R0)