Skip to main content

The Peres-Shields Order Estimator for Fixed and Variable Length Markov Models with Applications to DNA Sequence Similarity

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 3692))

Abstract

Recently Peres and Shields discovered a new method for estimating the order of a stationary fixed order Markov chain [15]. They showed that the estimator is consistent by proving a threshold result. While this threshold is valid asymptotically in the limit, it is not very useful for DNA sequence analysis where data sizes are moderate. In this paper we give a novel interpretation of the Peres-Shields estimator as a sharp transition phenomenon. This yields a precise and powerful estimator that quickly identifies the core dependencies in data. We show that it compares favorably to other estimators, especially in the presence of noise and/or variable dependencies. Motivated by this last point, we extend the Peres-Shields estimator to Variable Length Markov Chains. We give an application to the problem of detecting DNA sequence similarity using genomic signatures.

Abbreviations: Mk = Fixed order Markov model of order k, PST = Prediction suffix tree, MC = Markov chain, VLMC = Variable length Markov chain.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Akaike, H.: A new look at the statistical model identification. IEEE Trans. Auto. Cont. 19, 716–723 (1974)

    Article  MATH  MathSciNet  Google Scholar 

  2. Bejerano, G., Yona, G.: Variations on probabilistic suffix trees: statistical modeling and prediction of protein families. Bioinformatics 17(1), 23–43 (2001)

    Article  Google Scholar 

  3. Borodovsky, M., McIninch, J.: Recognition of genes in DNA sequence with ambiguities. Biosystems 30, 161–171 (1993)

    Article  Google Scholar 

  4. Bühlmann, P., Wyner, A.: Variable length Markov chains, Ann. Statist. 27(2), 480–513 (1999)

    Article  MATH  MathSciNet  Google Scholar 

  5. Bühlmann, P., Wyner, A.: Model selection for variable length Markov chains and tuning the context algorithm. Annals of the Inst. of Stat. Math. 52(2), 287–315 (2000)

    Article  MATH  Google Scholar 

  6. Csiszàr, I., Shields, P.: The Consistency of the BIC Markov Order Estimator. The Annals of Statistics. 28(6), 1601–1619 (2000)

    Article  MATH  MathSciNet  Google Scholar 

  7. Dalevi, D., Dubhashi, D.: Bayesian Classifiers for Detecting HGT Using Fixed and Variable Length Markov Chains (submitted)

    Google Scholar 

  8. Durbin, R., Eddy, S., Krogh, A., Mitchison, G.: Biological Sequence Analysis. Cambridge University Press, Cambridge (2004)

    Google Scholar 

  9. Ellrott, K., Yang, C., Saldek, M., Jiang, T.: Identifying transcription binding sites through Markov chain optimization. Bioinformatics 18(2), 100–109 (2002)

    Google Scholar 

  10. Fan, T.-H., Tsai, C.: A Bayesian Method in Determining the Order of a Finite State Markov Chain. Comm. Statist. Theory and Methods 28(7), 1711–1730 (1999)

    Article  MATH  MathSciNet  Google Scholar 

  11. Forsdyke, D.: Different Biological Species “Broadcast” Their DNAs at Different (G+C)% “Wavelengths”. J. Theor. Biol. 178, 405–417 (1996)

    Article  Google Scholar 

  12. Karlin, S., Burge, C.: Dinucleotide relative abundance extremes: a genomic signature. Trends Genet 11(7), 283–290 (1995)

    Article  Google Scholar 

  13. Mächler, M., Bühlmann, P.: Variable Length Markov Chains: Methodology, Computing, and Software. J Comp Graph Stat 13(2), 435–455 (2004)

    Article  Google Scholar 

  14. McDiarmid, C.: Concentration. In: Habib, M., McDiarmid, C., Ramirez-Alfonsin, J., Reed, B. (eds.) Probabilistic Methods for Algorithmic Discrete Mathematics Series: Algorithms and Combinatorics, vol. 16, pp. 195–248. Springer, Berlin (1998)

    Google Scholar 

  15. Peres, Y., Shields, P.: Two New Markov Order Estimators, to appear, see, http://www.math.utoledo.edu/~pshields/latex.html

  16. Pride, D., Meinersmann, R., Wassenaar, T., Blaser, M.: Evolutionary implications of microbial genome tetranucleotide frequency biases. Genome Res. 13, 145–158 (2003)

    Article  Google Scholar 

  17. Ron, D., Singer, Y., Tishby, N.: The Power of Amnesia: Learning Probabilistic Automata with Variable Memory Length. Machine Learning 25(2-3), 117–149 (1996)

    Article  MATH  Google Scholar 

  18. Sandberg, R., Winberg, G., Branden, C.I., Kaske, A., Ernberg, I., Coster, J.: Capturing whole-genome characteristics in short sequences using a naïve Bayesian classifier. Genome Res. 11(8), 1404–1409 (2001)

    Article  Google Scholar 

  19. Schwartz, G.: Estimating the dimension of a model. Annals of Statistics 6, 461–464 (1978)

    Article  MathSciNet  Google Scholar 

  20. Zhao, X., Huang, H., Speed, T.: Finding Short DNA motifs using Permuted Markov models. In: RECOMB, pp. 68–75 (2004)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Dalevi, D., Dubhashi, D. (2005). The Peres-Shields Order Estimator for Fixed and Variable Length Markov Models with Applications to DNA Sequence Similarity. In: Casadio, R., Myers, G. (eds) Algorithms in Bioinformatics. WABI 2005. Lecture Notes in Computer Science(), vol 3692. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11557067_24

Download citation

  • DOI: https://doi.org/10.1007/11557067_24

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-29008-7

  • Online ISBN: 978-3-540-31812-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics