Skip to main content

r-Indexing the eBWT

  • Conference paper
  • First Online:
Book cover String Processing and Information Retrieval (SPIRE 2021)

Abstract

The extended Burrows Wheeler Transform (\(\mathrm {eBWT}\)) was introduced by Mantaci et al. [TCS 2007] to extend the definition of the \(\mathrm {BWT}\)  to a collection of strings. In our prior work [SPIRE 2021], we give a linear-time algorithm for the \(\mathrm {eBWT}\) that preserves the fundamental property of the original definition (i.e., the independence from the input order). The algorithm combines a modification of the Suffix Array Induced Sorting (SAIS) algorithm [IEEE Trans Comput 2011] with Prefix Free Parsing [AMB 2019; JCB 2020]. In this paper, we show how this construction algorithm leads to r-indexing the \(\mathrm {eBWT}\), i.e., run-length encoded \(\mathrm {eBWT}\) and \(\mathrm {SA}\) samples of Gagie et al. [SODA 2018] can be constructed efficiently from the components of the PFP. Moreover, we show that finding maximal exact matches (MEMs) between a query string and the r-index of the \(\mathrm {eBWT}\) can be efficiently supported.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 54.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 69.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Our conjugate array \(\mathrm {CA}\) is called circular suffix array and denoted \(\mathrm {SA}_{\circ }\) in [2, 11], and BW-array in [13], but in both cases defined for primitive strings only.

References

  1. Bannai, H., Gagie, T., Tomohiro, I.: Refining the r-index. Theor. Comput. Sci. 812, 96–108 (2020)

    Article  MathSciNet  Google Scholar 

  2. Bannai, H., Kärkkäinen, J., Köppl, D., Piatkowski, M.: Constructing the bijective and the extended burrows-wheeler-transform in linear time. In: Proceedings of the 32nd Annual Symposium on Combinatorial Pattern Matching, CPM 2021. LIPIcs, vol. 191, pp. 7:1–7:16 (2021)

    Google Scholar 

  3. Belazzougui, D., Navarro, G.: Optimal lower and upper bounds for representing sequences. ACM Trans. Algorithms 11(4), 31:1-31:21 (2015)

    Article  MathSciNet  Google Scholar 

  4. Boucher, C., Cenzato, D., Lipták, Zs., Rossi, M., Sciortino, M.: Computing the original eBWT faster, simpler, and with less memory. In: Lecroq, T., Touzet, H. (eds.) SPIRE 2021. LNCS, vol. 12944, pp. 129–142. Springer, Cham (2021)

    Google Scholar 

  5. Boucher, C., Gagie, T., Kuhnle, A., Langmead, B., Manzini, G., Mun, T.: Prefix-free parsing for building big BWTs. Algorithms Mol. Biol. 14(1), 13:1-13:15 (2019)

    Article  Google Scholar 

  6. Burrows, M., Wheeler, D.J.: A block sorting lossless data compression algorithm. Technical report 124, Digital Equipment Corporation (1994)

    Google Scholar 

  7. Cobas, D., Gagie, T., Navarro, G.: A fast and small subsampled r-index. In: Proceedings of the 32nd Annual Symposium on Combinatorial Pattern Matching, CPM 2021. LIPIcs, vol. 191, pp. 13:1–13:16 (2021)

    Google Scholar 

  8. Gagie, T., Navarro, G., Prezza, N.: Optimal-time text indexing in BWT-runs bounded space. In: Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2018, pp. 1459–1477 (2018)

    Google Scholar 

  9. Gagie, T., Navarro, G., Prezza, N.: Fully functional suffix trees and optimal text searching in BWT-runs bounded space. J. ACM 67(1), 2:1-2:54 (2020)

    Article  MathSciNet  Google Scholar 

  10. Gessel, I.M., Reutenauer, C.: Counting permutations with given cycle structure and descent set. J. Combin. Theory Ser. A 64(2), 189–215 (1993)

    Article  MathSciNet  Google Scholar 

  11. Hon, W.-K., Ku, T.-H., Lu, C.-H., Shah, R., Thankachan, S.V.: Efficient algorithm for circular Burrows-Wheeler Transform. In: Kärkkäinen, J., Stoye, J. (eds.) CPM 2012. LNCS, vol. 7354, pp. 257–268. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-31265-6_21

    Chapter  Google Scholar 

  12. Kärkkäinen, J., Manzini, G., Puglisi, S.J.: Permuted longest-common-prefix array. In: Kucherov, G., Ukkonen, E. (eds.) CPM 2009. LNCS, vol. 5577, pp. 181–192. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-02441-2_17

    Chapter  Google Scholar 

  13. Kucherov, G., Tóthmérész, L., Vialette, S.: On the combinatorics of suffix arrays. Inf. Process. Lett. 113(22–24), 915–920 (2013)

    Article  MathSciNet  Google Scholar 

  14. Kuhnle, A., Mun, T., Boucher, C., Gagie, T., Langmead, B., Manzini, G.: Efficient construction of a complete index for pan-genomics read alignment. J. Comput. Biol. 27(4), 500–513 (2020)

    Article  MathSciNet  Google Scholar 

  15. Langmead, B., Trapnell, C., Pop, M., Salzberg, S.: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10, R25 (2009)

    Article  Google Scholar 

  16. Li, H., Durbin, R.: Fast and accurate short read alignment with Burrows-Wheeler Transform. Bioinformatics 25(14), 1754–1760 (2009)

    Article  Google Scholar 

  17. Mäkinen, V., Navarro, G.: Succinct suffix arrays based on run-length encoding. Nord J. Comput. 12, 40–66 (2005)

    MathSciNet  MATH  Google Scholar 

  18. Mäkinen, V., Välimäki, N., Laaksonen, A., Katainen, A.: Algorithms and Applications. Springer, Heidelberg (2010)

    Google Scholar 

  19. Mäkinen, V., Navarro, G., Sirén, J., Välimäki, N.: Storage and retrieval of highly repetitive sequence collections. J. Comput. Biol. 17(3), 281–308 (2010)

    Article  MathSciNet  Google Scholar 

  20. Manber, U., Myers, G.W.: Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993)

    Article  MathSciNet  Google Scholar 

  21. Mantaci, S., Restivo, A., Rosone, G., Sciortino, M.: An extension of the Burrows-Wheeler Transform. Theor. Comput. Sci. 387(3), 298–312 (2007)

    Article  MathSciNet  Google Scholar 

  22. Navarro, G.: Compact Data Structures: A Practical Approach. Cambridge University Press, Cambridge (2016)

    Book  Google Scholar 

  23. Nishimoto, T., Tabei, Y.: Optimal-time queries on BWT-runs compressed indexes. In: Proceedings of the 48th International Colloquium on Automata, Languages, and Programming, ICALP 2021. LIPIcs, vol. 198, pp. 101:1–101:15 (2021)

    Google Scholar 

  24. Nong, G., Zhang, S., Chan, W.H.: Two efficient algorithms for linear time suffix array construction. IEEE Trans. Comput. 60(10), 1471–1484 (2011)

    Article  MathSciNet  Google Scholar 

  25. Policriti, A., Prezza, N.: LZ77 computation based on the run-length encoded BWT. Algorithmica 80, 1986–2011 (2017)

    Article  MathSciNet  Google Scholar 

  26. Rossi, M., Oliva, M., Langmead, B., Gagie, T., Boucher, C.: MONI: a pangenomics index for finding MEMs. In: Proceedings of the 25th Annual International Conference on Research in Computational Molecular Biology, RECOMB 2021 (2021)

    Google Scholar 

  27. Sun, C., et al.: RPAN: rice pan-genome browser for 3000 rice genomes. Nucleic Acids Res. 45(2), 597–605 (2017)

    Article  Google Scholar 

  28. The 1001 Genomes Consortium. Epigenomic diversity in a global collection of arabidopsis thaliana accessions. Cell 166(2), 492–505 (2016)

    Google Scholar 

  29. Turnbull, C., et al.: The 100,000 genomes project: bringing whole genome sequencing to the NHS. Br. Med. J. 361 (2018)

    Google Scholar 

Download references

Acknowledgements

We thank Travis Gagie for comments on preliminary versions of this manuscript. CB and MR are funded by National Science Foundation NSF IIBR (Grant No. 2029552), NSF SCH (Grant No. 2013998), National Institutes of Health (NIH) NIAID (Grant No. HG011392) and NIH NIAID (Grant No. R01AI141810).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Christina Boucher .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Boucher, C., Cenzato, D., Lipták, Z., Rossi, M., Sciortino, M. (2021). r-Indexing the eBWT. In: Lecroq, T., Touzet, H. (eds) String Processing and Information Retrieval. SPIRE 2021. Lecture Notes in Computer Science(), vol 12944. Springer, Cham. https://doi.org/10.1007/978-3-030-86692-1_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-86692-1_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-86691-4

  • Online ISBN: 978-3-030-86692-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics