Skip to main content

Faster Variance Computation for Patterns with Gaps

  • Conference paper
Design and Analysis of Algorithms (MedAlg 2012)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7659))

Included in the following conference series:

Abstract

Determining whether a pattern is statistically overrepresented or underrepresented in a string is a fundamental primitive in computational biology and in large-scale text mining. We study ways to speed up the computation of the expectation and variance of the number of occurrences of a pattern with rigid gaps in a random string. Our contributions are twofold: first, we focus on patterns in which groups of characters from an alphabet Σ can occur at each position. We describe a way to compute the exact expectation and variance of the number of occurrences of a pattern w in a random string generated by a Markov chain in O(|w|2) time, improving a previous result that required O(2|w|) time. We then consider the problem of computing expectation and variance of the motifs of a string s in an iid text. Motifs are rigid gapped patterns that occur at least twice in s, and in which at most one character from Σ occurs at each position. We study the case in which s is given offline, and an arbitrary motif w of s is queried online. We relate computational complexity to the structure of w and s, identifying sets of motifs that are amenable to o(|w|log|w|) time online computation after O(|s|3) preprocessing of s. Our algorithms lend themselves to efficient implementations.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Reinert, G., Schbath, S., Waterman, M.: Probabilistic and statistical properties of words: an overview. Journal of Computational Biology 7, 1–46 (2000)

    Article  Google Scholar 

  2. Apostolico, A., Bock, M., Xu, X.: Annotated statistical indices for sequence analysis. In: Proceedings of the Compression and Complexity of Sequences, Sequences 1997, pp. 215–229. IEEE Computer Society, Washington, DC (1997)

    Google Scholar 

  3. Apostolico, A., Bock, M., Lonardi, S.: Monotony of surprise and large-scale quest for unusual words. In: Proceedings of the Sixth Annual International Conference on Computational Biology, RECOMB 2002, pp. 22–31. ACM, New York (2002)

    Chapter  Google Scholar 

  4. Apostolico, A., Bock, M., Lonardi, S., Xu, X.: Efficient detection of unusual words. Journal of Computational Biology 7(1), 71–94 (2000)

    Article  Google Scholar 

  5. Apostolico, A., Pizzi, C.: Monotone Scoring of Patterns with Mismatches. In: Jonassen, I., Kim, J. (eds.) WABI 2004. LNCS (LNBI), vol. 3240, pp. 87–98. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  6. Pizzi, C., Bianco, M.: Expectation of Strings with Mismatches under Markov Chain Distribution. In: Karlgren, J., Tarhio, J., Hyyrö, H. (eds.) SPIRE 2009. LNCS, vol. 5721, pp. 222–233. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  7. Ferreira, P., Azevedo, P.: Evaluating deterministic motif significance measures in protein databases. Algorithms for Molecular Biology 2(1), 16 (2007)

    Article  Google Scholar 

  8. Flajolet, P., Guivarc’h, Y., Szpankowski, W., Vallée, B.: Hidden Pattern Statistics. In: Yu, Y., Spirakis, P.G., van Leeuwen, J. (eds.) ICALP 2001. LNCS, vol. 2076, pp. 152–165. Springer, Heidelberg (2001)

    Chapter  Google Scholar 

  9. Gwadera, R., Atallah, M., Szpankowski, W.: Reliable detection of episodes in event sequences. In: Knowledge and Information Systems, pp. 67–74 (2004)

    Google Scholar 

  10. Nicodème, P., Salvy, B., Flajolet, P.: Motif statistics. Theoretical Computer Science 287, 593–617 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  11. Robin, S., Daudin, J.J., Richard, H., Sagot, M.F., Schbath, S.: Occurrence probability of structured motifs in random sequences. Journal of Computational Biology, 761–774 (2002)

    Google Scholar 

  12. Stolovitzky, G., Califano, A.: Statistical significance of patterns in biosequences. IBM research report (1998)

    Google Scholar 

  13. Parida, L., Rigoutsos, I., Floratos, A., Platt, D., Gao, Y.: Pattern discovery on character sets and real-valued data: linear bound on irredundant motifs and an efficient polynomial time algorithm. In: Proceedings of the Eleventh Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2000, pp. 297–308. Society for Industrial and Applied Mathematics, Philadelphia (2000)

    Google Scholar 

  14. Apostolico, A., Comin, M., Parida, L.: Conservative extraction of over-represented extensible motifs. Bioinformatics 21, i9–i18 (2005)

    Google Scholar 

  15. Califano, A.: SPLASH: structural pattern localization analysis by sequential histograms. Bioinformatics 16, 341–357 (2000)

    Article  Google Scholar 

  16. Rigoutsos, I., Floratos, A.: Combinatorial pattern discovery in biological sequences: the TEIRESIAS algorithm. Bioinformatics 14(1), 55–67 (1998)

    Article  Google Scholar 

  17. Sinha, S., Tompa, M.: A statistical method for finding transcription factor binding sites. In: Proc. Int. Conf. Intell. Syst. Mol. Biol., vol. 8, pp. 344–354 (2000)

    Google Scholar 

  18. Sinha, S., Tompa, M.: Discovery of novel transcription factor binding sites by statistical overrepresentation. Nucleic Acids Research 30(24), 5549–5560 (2002)

    Article  Google Scholar 

  19. Kleffe, J., Borodovsky, M.: First and second moment of counts of words in random texts generated by Markov chains. Bioinformatics/Computer Applications in the Biosciences 8, 433–441 (1992)

    Google Scholar 

  20. Fischer, M., Paterson, M.: String-matching and other products. Technical report, Massachusetts Institute of Technology, Cambridge, MA, USA (1974)

    Google Scholar 

  21. Cole, R., Hariharan, R.: Verifying candidate matches in sparse and wildcard matching. In: Proceedings of the Thiry-fourth Annual ACM Symposium on Theory of Computing, STOC 2002, pp. 592–601. ACM, New York (2002)

    Chapter  Google Scholar 

  22. Sigrist, C., Cerutti, L., de Castro, E., Langendijk-Genevaux, P., Bulliard, V., Bairoch, A., Hulo, N.: PROSITE, a protein domain database for functional characterization and annotation. Nucleic Acids Research 38, 161–166 (2010)

    Article  Google Scholar 

  23. Apostolico, A., Parida, L.: Incremental paradigms of motif discovery. Journal of Computational Biology 11, 15–25 (2004)

    Article  Google Scholar 

  24. Pisanti, N., Crochemore, M., Grossi, R., Sagot, M.-F.: A Basis of Tiling Motifs for Generating Repeated Patterns and Its Complexity for Higher Quorum. In: Rovan, B., Vojtáš, P. (eds.) MFCS 2003. LNCS, vol. 2747, pp. 622–631. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  25. Pisanti, N., Crochemore, M., Grossi, R., Sagot, M.: Bases of motifs for generating repeated patterns with wildcards. IEEE/ACM Transactions on Computational Biology and Bioinformatics 2(1), 40–50 (2005)

    Article  Google Scholar 

  26. Blanchette, M., Sinha, S.: Separating real motifs from their artifacts. Bioinformatics 17(1), S30–S38 (2001)

    Google Scholar 

  27. Lempel, A., Ziv, J.: On the complexity of finite sequences. IEEE Transactions on Information Theory 22(1), 75–81 (1976)

    Article  MathSciNet  MATH  Google Scholar 

  28. Parida, L., Rigoutsos, I., Platt, D.: An Output-Sensitive Flexible Pattern Discovery Algorithm. In: Amir, A., Landau, G.M. (eds.) CPM 2001. LNCS, vol. 2089, pp. 131–142. Springer, Heidelberg (2001)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Cunial, F. (2012). Faster Variance Computation for Patterns with Gaps. In: Even, G., Rawitz, D. (eds) Design and Analysis of Algorithms. MedAlg 2012. Lecture Notes in Computer Science, vol 7659. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-34862-4_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-34862-4_10

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-34861-7

  • Online ISBN: 978-3-642-34862-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics