Skip to main content

Indexed Matching Statistics and Shortest Unique Substrings

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8799))

Abstract

The unidirectional and bidirectional matching statistics between two strings s and t on alphabet Σ, and the shortest unique substrings of a single string t, are the cornerstone of a number of large-scale genome analysis applications, and they encode nontrivial structural properties of s and t. In this paper we compute for the first time the matching statistics between s and t in O((|s| + |t|)log|Σ|) time and in O(|s|log|Σ|) bits of space, circumventing the need for computing the depths of suffix tree nodes that characterized previous approaches. Symmetrically, we compute for the first time the shortest unique substrings of a string t in O(|t|log|Σ|) time and in O(|t|log|Σ|) bits of space. A key component of our methods is an encoding of both the unidirectional and the bidirectional statistics that takes 2|t| + o(|t|) bits of space and that allows constant-time access to every position.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Belazzougui, D.: Linear time construction of compressed text indices in compact space. In: Proceedings of the 46th ACM Symposium on Theory of Computing. ACM (2014)

    Google Scholar 

  2. Belazzougui, D.: Linear time construction of compressed text indices in compact space. ArXiv preprint ArXiv:1401.0936 (2014)

    Google Scholar 

  3. Belazzougui, D., Navarro, G.: Alphabet-independent compressed text indexing. ACM Transactions on Algorithms (2013) (to appear)

    Google Scholar 

  4. Clark, D.: Compact Pat Trees. PhD thesis, University of Waterloo, Canada (1996)

    Google Scholar 

  5. Farach, M., Noordewier, M., Savari, S., Shepp, L., Wyner, A., Ziv, J.: On the entropy of DNA: Algorithms and measurements based on memory and rapid convergence. In: Proceedings of the Sixth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 48–57 (1995)

    Google Scholar 

  6. Fischer, J., Heun, V.: Space-efficient preprocessing schemes for range minimum queries on static arrays. SIAM Journal on Computing 40(2), 465–492 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  7. Golynski, A.: Optimal lower bounds for rank and select indexes. Theoretical Computer Science 387(3), 348–359 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  8. Haubold, B., Pierstorff, N., Möller, F., Wiehe, T.: Genome comparison without alignment using shortest unique substrings. BMC Bioinformatics 6(1), 123 (2005)

    Article  Google Scholar 

  9. Hon, W.-K., Sadakane, K., Sung, W.-K.: Breaking a time-and-space barrier in constructing full-text indices. SIAM J. Comput. 38(6), 2162–2178 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  10. İleri, A.M., Külekci, M.O., Xu, B.: Shortest unique substring query revisited. In: Kulikov, A.S., Kuznetsov, S.O., Pevzner, P. (eds.) CPM 2014. LNCS, vol. 8486, pp. 172–181. Springer, Heidelberg (2014)

    Chapter  Google Scholar 

  11. Kurtz, S., Phillippy, A., Delcher, A.L., Smoot, M., Shumway, M., Antonescu, C., Salzberg, S.L.: Versatile and open software for comparing large genomes. Genome Biology 5(2), R12 (2004)

    Google Scholar 

  12. Munro, J.I.: Tables. In: Chandru, V., Vinay, V. (eds.) FSTTCS 1996. LNCS, vol. 1180, pp. 37–42. Springer, Heidelberg (1996)

    Chapter  Google Scholar 

  13. Ohlebusch, E., Gog, S., Kügel, A.: Computing matching statistics and maximal exact matches on compressed full-text indexes. In: Chavez, E., Lonardi, S. (eds.) SPIRE 2010. LNCS, vol. 6393, pp. 347–358. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  14. Pei, J., Wu, W.-H., Yeh, M.-Y.: On shortest unique substring queries. In: 2013 IEEE 29th International Conference on Data Engineering (ICDE), pp. 937–948. IEEE (2013)

    Google Scholar 

  15. Philippe, N., Salson, M., Commes, T., Rivals, E.: CRAC: An integrated approach to the analysis of RNA-seq reads. Genome Biology 14(3), R30 (2013)

    Google Scholar 

  16. Raman, R., Raman, V., Satti, S.R.: Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets. ACM Transactions on Algorithms (TALG) 3(4), 43 (2007)

    Article  MathSciNet  Google Scholar 

  17. Robertson, M.M.: A generalization of quasi-monotone sequences. Proceedings of the Edinburgh Mathematical Society (Series 2) 16(01), 37–41 (1968)

    Article  MATH  Google Scholar 

  18. Sadakane, K.: Compressed suffix trees with full functionality. Theory of Computing Systems 41(4), 589–607 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  19. Schnattinger, T., Ohlebusch, E., Gog, S.: Bidirectional search in a string with wavelet trees and bidirectional matching statistics. Inf. Comput. 213, 13–22 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  20. Tsuruta, K., Inenaga, S., Bannai, H., Takeda, M.: Shortest unique substrings queries in optimal time. In: Geffert, V., Preneel, B., Rovan, B., Štuller, J., Tjoa, A.M. (eds.) SOFSEM 2014. LNCS, vol. 8327, pp. 503–513. Springer, Heidelberg (2014)

    Google Scholar 

  21. Ulitsky, I., Burstein, D., Tuller, T., Chor, B.: The average common substring approach to phylogenomic reconstruction. Journal of Computational Biology 13(2), 336–350 (2006)

    Article  MathSciNet  Google Scholar 

  22. Weiner, P.: The file transmission problem. In: Proceedings of the National Computer Conference and Exposition, June 4-8, pp. 453–453. ACM (1973)

    Google Scholar 

  23. Weiner, P.: Linear pattern matching algorithms. In: Switching and Automata Theory, pp. 1–11. IEEE (1973)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Belazzougui, D., Cunial, F. (2014). Indexed Matching Statistics and Shortest Unique Substrings. In: Moura, E., Crochemore, M. (eds) String Processing and Information Retrieval. SPIRE 2014. Lecture Notes in Computer Science, vol 8799. Springer, Cham. https://doi.org/10.1007/978-3-319-11918-2_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-11918-2_18

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-11917-5

  • Online ISBN: 978-3-319-11918-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics