Composite Repetition-Aware Data Structures

Belazzougui, Djamal; Cunial, Fabio; Gagie, Travis; Prezza, Nicola; Raffinot, Mathieu

doi:10.1007/978-3-319-19929-0_3

Djamal Belazzougui^16,17,
Fabio Cunial^16,17,
Travis Gagie^16,17,
Nicola Prezza¹⁸ &
…
Mathieu Raffinot¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9133))

Included in the following conference series:

Annual Symposium on Combinatorial Pattern Matching

1034 Accesses

Abstract

In highly repetitive strings, like collections of genomes from the same species, distinct measures of repetition all grow sublinearly in the length of the text, and indexes targeted to such strings typically depend only on one of these measures. We describe two data structures whose size depends on multiple measures of repetition at once, and that provide competitive tradeoffs between the time for counting and reporting all the exact occurrences of a pattern, and the space taken by the structure. The key component of our constructions is the run-length encoded BWT (RLBWT), which takes space proportional to the number of BWT runs: rather than augmenting RLBWT with suffix array samples, we combine it with data structures from LZ77 indexes, which take space proportional to the number of LZ77 factors, and with the compact directed acyclic word graph (CDAWG), which takes space proportional to the number of extensions of maximal repeats. The combination of CDAWG and RLBWT enables also a new representation of the suffix tree, whose size depends again on the number of extensions of maximal repeats, and that is powerful enough to support matching statistics and constant-space traversal.

Travis Gagie—Supported by the Academy of Finland.

This work was partially supported by Academy of Finland under grant 284598 (Center of Excellence in Cancer Genetics Research).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Flexible Indexing of Repetitive Collections

Faster Repetition-Aware Compressed Suffix Trees Based on Block Trees

CHICO: A Compressed Hybrid Index for Repetitive Collections

References

Arroyuelo, D., Navarro, G., Sadakane, K.: Stronger Lempel-Ziv based compressed text indexing. Algorithmica 62(1–2), 54–101 (2012)
Article MATH MathSciNet Google Scholar
Belazzougui, D., Cunial, F., Kärkkäinen, J., Mäkinen, V.: Versatile succinct representations of the bidirectional burrows-wheeler transform. In: Bodlaender, H.L., Italiano, G.F. (eds.) ESA 2013. LNCS, vol. 8125, pp. 133–144. Springer, Heidelberg (2013)
Chapter Google Scholar
Blumer, A., Blumer, J., Haussler, D., McConnell, R., Ehrenfeucht, A.: Complete inverted files for efficient text retrieval and analysis. J. ACM 34(3), 578–595 (1987)
Article MathSciNet Google Scholar
Chan, T.M., Larsen, K.G., Pătraşcu, M.: Orthogonal range searching on the RAM, revisited. In: Proceedings of SoCG, pp. 1–10 (2011)
Google Scholar
Crochemore, M., Hancart, C.: Automata for matching patterns. In: Rozenberg, G., Salomaa, A. (eds.) Handbook of Formal Languages, vol. 2, pp. 399–462. Springer, Heidelberg (1997)
Chapter Google Scholar
Crochemore, M., Vérin, R.: Direct construction of compact directed acyclic word graphs. In: Apostolico, A., Hein, J. (eds.) Proceedings of CPM. LNCS, vol. 1264, pp. 116–129. Springer, Heidelberg (1997)
Chapter Google Scholar
Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM 52(4), 552–581 (2005)
Article MathSciNet Google Scholar
P. Ferragina and G. Navarro. Pizza&Chili repetitive corpus. http://pizzachili.dcc.uchile.cl/repcorpus.html. Accessed on 25 January 2015
Gagie, T., Gawrychowski, P., Kärkkäinen, J., Nekrich, Y., Puglisi, S.J.: LZ77-based self-indexing with faster pattern matching. In: Pardo, A., Viola, A. (eds.) LATIN 2014. LNCS, vol. 8392, pp. 731–742. Springer, Heidelberg (2014)
Chapter Google Scholar
Kärkkäinen, J., Ukkonen, E.: Lempel-Ziv parsing and sublinear-size index structures for string matching. In: Proceedings of WSP, pp. 141–155 (1996)
Google Scholar
Kreft, S., Navarro, G.: On compressing and indexing repetitive sequences. Theor. Comput. Sci. 483, 115–133 (2013)
Article MATH MathSciNet Google Scholar
Lempel, A., Ziv, J.: On the complexity of finite sequences. IEEE Trans. Info. Theory 22(1), 75–81 (1976)
Article MATH MathSciNet Google Scholar
Mäkinen, V., Navarro, G.: Succinct suffix arrays based on run-length encoding. In: Apostolico, A., Crochemore, M., Park, K. (eds.) CPM 2005. LNCS, vol. 3537, pp. 45–56. Springer, Heidelberg (2005)
Chapter Google Scholar
Mäkinen, V., Navarro, G., Sirén, J., Välimäki, N.: Storage and retrieval of highly repetitive sequence collections. J. Comput. Biol. 17(3), 281–308 (2010)
Article MathSciNet Google Scholar
Radoszewski, J., Rytter, W.: On the structure of compacted subword graphs of ThueMorse words and their applications. J. Discret. Algorithms 11, 15–24 (2012)
Article MATH MathSciNet Google Scholar
Raffinot, M.: On maximal repeats in strings. Inform. Process. Lett. 80(3), 165–169 (2001)
Article MATH MathSciNet Google Scholar
Rytter, W.: The structure of subword graphs and suffix trees of Fibonacci words. Theoret. Comput. Sci. 363(2), 211–223 (2006)
Article MATH MathSciNet Google Scholar
Sirén, J., Välimäki, N., Mäkinen, V., Navarro, G.: Run-length compressed indexes are superior for highly repetitive sequence collections. In: Amir, A., Turpin, A., Moffat, A. (eds.) SPIRE 2008. LNCS, vol. 5280, pp. 164–175. Springer, Heidelberg (2008)
Chapter Google Scholar
Willard, D.E.: Log-logarithmic worst-case range queries are possible in space $\Theta (N)$. Inform. Process. Lett. 17(2), 81–84 (1983)
Article MATH MathSciNet Google Scholar
Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Info. Theory 23(3), 337–343 (1977)
Article MATH MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Helsinki, Helsinki, Finland
Djamal Belazzougui, Fabio Cunial & Travis Gagie
Helsinki Institute for Information Technology, Helsinki, Finland
Djamal Belazzougui, Fabio Cunial & Travis Gagie
Department of Mathematics and Computer Science, University of Udine, Udine, Italy
Nicola Prezza
LIAFA, Paris Diderot University, Paris 7, France
Mathieu Raffinot

Authors

Djamal Belazzougui
View author publications
You can also search for this author in PubMed Google Scholar
Fabio Cunial
View author publications
You can also search for this author in PubMed Google Scholar
Travis Gagie
View author publications
You can also search for this author in PubMed Google Scholar
Nicola Prezza
View author publications
You can also search for this author in PubMed Google Scholar
Mathieu Raffinot
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Djamal Belazzougui .

Editor information

Editors and Affiliations

Department of Computer Science, University of Verona, Verona, Italy
Ferdinando Cicalese
Department of Computer Science, Bar-Ilan University, Ramat Gan, Israel
Ely Porat
Department of Computer Science, University of Salerno, Fisciano, Italy
Ugo Vaccaro

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Belazzougui, D., Cunial, F., Gagie, T., Prezza, N., Raffinot, M. (2015). Composite Repetition-Aware Data Structures. In: Cicalese, F., Porat, E., Vaccaro, U. (eds) Combinatorial Pattern Matching. CPM 2015. Lecture Notes in Computer Science(), vol 9133. Springer, Cham. https://doi.org/10.1007/978-3-319-19929-0_3

Download citation

DOI: https://doi.org/10.1007/978-3-319-19929-0_3
Published: 16 June 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-19928-3
Online ISBN: 978-3-319-19929-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics