Abstract
We study a problem related to the extraction of over-represented words from a given source text x, of length n. The words are allowed to occur with k mismatches, and x is produced by a source over an alphabet Σ according to a Markov chain of order p. We propose an online algorithm to compute the expected number of occurrences of a word y of length m in O(mk |Σ|p + 1). We also propose an offline algorithm to compute the probability of any word that occurs in the text in O(k|Σ|2) after O(nk |Σ|p + 1) pre-processing. This algorithm allows us to compute the expectation for all the words in a text of length n in O(kn 2|Σ|2 + nk |Σ|p + 1), rather than in O(n 3 |Σ|p + 1) that can be obtained with other methods. Although this study was motivated by the motif discovery problem in bioinformatics, the results find their applications in any other domain involving combinatorics on words.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Apostolico, A., Pizzi, C.: Motif Discovery by Monotone Scores. Discrete Applied Mathematics 155(6-7), 695–706 (2007); special issue Computational Molecular Biology Series
Bailey, T.L., Williams, N., Misleh, C., Li, W.W.: MEME: Discovering and Analyzing DNA and Protein Sequence Motifs. NAR 34, W369–W373, (2006)
Brazma, A., Jonassen, I., Ukkonen, E., Vilo, J.: Predicting Gene Regulatory Elements in Silico on a Genomic Scale. Genome Research 11, 1202–1215 (1998)
Boeva, V., Clément, J., Régnier, M., Vandenbogaert, M.: Assessing the significance of sets of words. In: Apostolico, A., Crochemore, M., Park, K. (eds.) CPM 2005. LNCS, vol. 3537, pp. 358–370. Springer, Heidelberg (2005)
Régnier, M., Vandenbogaert, M.: Comparison of Statistical Significance Criteria. J. Bioinformatics and Computational Biology 4(2), 537–552 (2006)
Sandve, K., Drablos, F.: A survey of motif discovery methods in an integrated framework. Biology Direct 1(11) (2006)
Sinha, S., Tompa, M.: YMF: a Program for Discovery of Novel Transcription Factor Binding Sites by Statistical Overrepresentation. NAR 31(13), 3586–3588 (2003)
Stormo, G.D.: DNA Binding Sites: Representation and Discovery. Bioinformatics 16(1), 16–23 (2000)
Tompa, M., et al.: Assessing Computational Tools for the Discovery of Transcription Factor Binding Sites. Nature Biotechnology 23(1), 137–144 (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Pizzi, C., Bianco, M. (2009). Expectation of Strings with Mismatches under Markov Chain Distribution. In: Karlgren, J., Tarhio, J., Hyyrö, H. (eds) String Processing and Information Retrieval. SPIRE 2009. Lecture Notes in Computer Science, vol 5721. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-03784-9_22
Download citation
DOI: https://doi.org/10.1007/978-3-642-03784-9_22
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-03783-2
Online ISBN: 978-3-642-03784-9
eBook Packages: Computer ScienceComputer Science (R0)