Abstract
The extended Burrows Wheeler Transform (\(\mathrm {eBWT}\)) was introduced by Mantaci et al. [TCS 2007] to extend the definition of the \(\mathrm {BWT}\) to a collection of strings. In our prior work [SPIRE 2021], we give a linear-time algorithm for the \(\mathrm {eBWT}\) that preserves the fundamental property of the original definition (i.e., the independence from the input order). The algorithm combines a modification of the Suffix Array Induced Sorting (SAIS) algorithm [IEEE Trans Comput 2011] with Prefix Free Parsing [AMB 2019; JCB 2020]. In this paper, we show how this construction algorithm leads to r-indexing the \(\mathrm {eBWT}\), i.e., run-length encoded \(\mathrm {eBWT}\) and \(\mathrm {SA}\) samples of Gagie et al. [SODA 2018] can be constructed efficiently from the components of the PFP. Moreover, we show that finding maximal exact matches (MEMs) between a query string and the r-index of the \(\mathrm {eBWT}\) can be efficiently supported.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Bannai, H., Gagie, T., Tomohiro, I.: Refining the r-index. Theor. Comput. Sci. 812, 96–108 (2020)
Bannai, H., Kärkkäinen, J., Köppl, D., Piatkowski, M.: Constructing the bijective and the extended burrows-wheeler-transform in linear time. In: Proceedings of the 32nd Annual Symposium on Combinatorial Pattern Matching, CPM 2021. LIPIcs, vol. 191, pp. 7:1–7:16 (2021)
Belazzougui, D., Navarro, G.: Optimal lower and upper bounds for representing sequences. ACM Trans. Algorithms 11(4), 31:1-31:21 (2015)
Boucher, C., Cenzato, D., Lipták, Zs., Rossi, M., Sciortino, M.: Computing the original eBWT faster, simpler, and with less memory. In: Lecroq, T., Touzet, H. (eds.) SPIRE 2021. LNCS, vol. 12944, pp. 129–142. Springer, Cham (2021)
Boucher, C., Gagie, T., Kuhnle, A., Langmead, B., Manzini, G., Mun, T.: Prefix-free parsing for building big BWTs. Algorithms Mol. Biol. 14(1), 13:1-13:15 (2019)
Burrows, M., Wheeler, D.J.: A block sorting lossless data compression algorithm. Technical report 124, Digital Equipment Corporation (1994)
Cobas, D., Gagie, T., Navarro, G.: A fast and small subsampled r-index. In: Proceedings of the 32nd Annual Symposium on Combinatorial Pattern Matching, CPM 2021. LIPIcs, vol. 191, pp. 13:1–13:16 (2021)
Gagie, T., Navarro, G., Prezza, N.: Optimal-time text indexing in BWT-runs bounded space. In: Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA 2018, pp. 1459–1477 (2018)
Gagie, T., Navarro, G., Prezza, N.: Fully functional suffix trees and optimal text searching in BWT-runs bounded space. J. ACM 67(1), 2:1-2:54 (2020)
Gessel, I.M., Reutenauer, C.: Counting permutations with given cycle structure and descent set. J. Combin. Theory Ser. A 64(2), 189–215 (1993)
Hon, W.-K., Ku, T.-H., Lu, C.-H., Shah, R., Thankachan, S.V.: Efficient algorithm for circular Burrows-Wheeler Transform. In: Kärkkäinen, J., Stoye, J. (eds.) CPM 2012. LNCS, vol. 7354, pp. 257–268. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-31265-6_21
Kärkkäinen, J., Manzini, G., Puglisi, S.J.: Permuted longest-common-prefix array. In: Kucherov, G., Ukkonen, E. (eds.) CPM 2009. LNCS, vol. 5577, pp. 181–192. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-02441-2_17
Kucherov, G., Tóthmérész, L., Vialette, S.: On the combinatorics of suffix arrays. Inf. Process. Lett. 113(22–24), 915–920 (2013)
Kuhnle, A., Mun, T., Boucher, C., Gagie, T., Langmead, B., Manzini, G.: Efficient construction of a complete index for pan-genomics read alignment. J. Comput. Biol. 27(4), 500–513 (2020)
Langmead, B., Trapnell, C., Pop, M., Salzberg, S.: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10, R25 (2009)
Li, H., Durbin, R.: Fast and accurate short read alignment with Burrows-Wheeler Transform. Bioinformatics 25(14), 1754–1760 (2009)
Mäkinen, V., Navarro, G.: Succinct suffix arrays based on run-length encoding. Nord J. Comput. 12, 40–66 (2005)
Mäkinen, V., Välimäki, N., Laaksonen, A., Katainen, A.: Algorithms and Applications. Springer, Heidelberg (2010)
Mäkinen, V., Navarro, G., Sirén, J., Välimäki, N.: Storage and retrieval of highly repetitive sequence collections. J. Comput. Biol. 17(3), 281–308 (2010)
Manber, U., Myers, G.W.: Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993)
Mantaci, S., Restivo, A., Rosone, G., Sciortino, M.: An extension of the Burrows-Wheeler Transform. Theor. Comput. Sci. 387(3), 298–312 (2007)
Navarro, G.: Compact Data Structures: A Practical Approach. Cambridge University Press, Cambridge (2016)
Nishimoto, T., Tabei, Y.: Optimal-time queries on BWT-runs compressed indexes. In: Proceedings of the 48th International Colloquium on Automata, Languages, and Programming, ICALP 2021. LIPIcs, vol. 198, pp. 101:1–101:15 (2021)
Nong, G., Zhang, S., Chan, W.H.: Two efficient algorithms for linear time suffix array construction. IEEE Trans. Comput. 60(10), 1471–1484 (2011)
Policriti, A., Prezza, N.: LZ77 computation based on the run-length encoded BWT. Algorithmica 80, 1986–2011 (2017)
Rossi, M., Oliva, M., Langmead, B., Gagie, T., Boucher, C.: MONI: a pangenomics index for finding MEMs. In: Proceedings of the 25th Annual International Conference on Research in Computational Molecular Biology, RECOMB 2021 (2021)
Sun, C., et al.: RPAN: rice pan-genome browser for 3000 rice genomes. Nucleic Acids Res. 45(2), 597–605 (2017)
The 1001 Genomes Consortium. Epigenomic diversity in a global collection of arabidopsis thaliana accessions. Cell 166(2), 492–505 (2016)
Turnbull, C., et al.: The 100,000 genomes project: bringing whole genome sequencing to the NHS. Br. Med. J. 361 (2018)
Acknowledgements
We thank Travis Gagie for comments on preliminary versions of this manuscript. CB and MR are funded by National Science Foundation NSF IIBR (Grant No. 2029552), NSF SCH (Grant No. 2013998), National Institutes of Health (NIH) NIAID (Grant No. HG011392) and NIH NIAID (Grant No. R01AI141810).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Boucher, C., Cenzato, D., Lipták, Z., Rossi, M., Sciortino, M. (2021). r-Indexing the eBWT. In: Lecroq, T., Touzet, H. (eds) String Processing and Information Retrieval. SPIRE 2021. Lecture Notes in Computer Science(), vol 12944. Springer, Cham. https://doi.org/10.1007/978-3-030-86692-1_1
Download citation
DOI: https://doi.org/10.1007/978-3-030-86692-1_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-86691-4
Online ISBN: 978-3-030-86692-1
eBook Packages: Computer ScienceComputer Science (R0)