skip to main content
research-article

Indexing Highly Repetitive String Collections, Part II: Compressed Indexes

Published:09 February 2021Publication History
Skip Abstract Section

Abstract

Two decades ago, a breakthrough in indexing string collections made it possible to represent them within their compressed space while at the same time offering indexed search functionalities. As this new technology permeated through applications like bioinformatics, the string collections experienced a growth that outperforms Moore’s Law and challenges our ability of handling them even in compressed form. It turns out, fortunately, that many of these rapidly growing string collections are highly repetitive, so that their information content is orders of magnitude lower than their plain size. The statistical compression methods used for classical collections, however, are blind to this repetitiveness, and therefore a new set of techniques has been developed to properly exploit it. The resulting indexes form a new generation of data structures able to handle the huge repetitive string collections that we are facing. In this survey, formed by two parts, we cover the algorithmic developments that have led to these data structures.

In this second part, we describe the fundamental algorithmic ideas and data structures that form the base of all the existing indexes, and the various concrete structures that have been proposed, comparing them both in theoretical and practical aspects, and uncovering some new combinations. We conclude with the current challenges in this fascinating field.

References

  1. A. Apostolico. 1985. The myriad virtues of subword trees. In Combinatorial Algorithms on Words (NATO ISI Series). Springer-Verlag, 85--96.Google ScholarGoogle Scholar
  2. R. Baeza-Yates and B. Ribeiro-Neto. 2011. Modern Information Retrieval (2nd ed.). Addison-Wesley.Google ScholarGoogle Scholar
  3. H. Bannai, T. Gagie, and T. I. 2020. Refining the r-index. Theor. Comput. Sci. 812 (2020), 96--108.Google ScholarGoogle ScholarCross RefCross Ref
  4. T. Batu, F. Ergün, and S. C. Sahinalp. 2006. Oblivious string embeddings and edit distance approximations. In Proceedings of the 17th Symposium on Discrete Algorithms (SODA’06). 792--801.Google ScholarGoogle Scholar
  5. D. Belazzougui, Paolo B., R. Pagh, and S. Vigna. 2010. Fast prefix search in little space, with applications. In Proceedings of the 18th Annual European Symposium on Algorithms (ESA’10). 427--438.Google ScholarGoogle Scholar
  6. D. Belazzougui, P. Boldi, R. Pagh, and S. Vigna. 2009. Monotone minimal perfect hashing: Searching a sorted table with O(1) accesses. In Proceedings of the 20th Annual Symposium on Discrete Mathematics (SODA’09). 785--794.Google ScholarGoogle Scholar
  7. D. Belazzougui and F. Cunial. 2017a. Fast label extraction in the CDAWG. In Proceedings of the 24th International Symposium on String Processing and Information Retrieval (SPIRE’17). 161--175.Google ScholarGoogle Scholar
  8. D. Belazzougui and F. Cunial. 2017b. Representing the suffix tree with the CDAWG. In Proceedings of the 28th Annual Symposium on Combinatorial Pattern Matching (CPM’17). 7:1--7:13.Google ScholarGoogle Scholar
  9. D. Belazzougui, F. Cunial, T. Gagie, N. Prezza, and M. Raffinot. 2015a. Composite repetition-aware data structures. In Proceedings of the 26th Annual Symposium on Combinatorial Pattern Matching (CPM’15). 26--39.Google ScholarGoogle Scholar
  10. D. Belazzougui, F. Cunial, T. Gagie, N. Prezza, and M. Raffinot. 2017. Flexible indexing of repetitive collections. In Proceedings of the 13th Conference on Computability in Europe (CiE’17). 162--174.Google ScholarGoogle Scholar
  11. D. Belazzougui, F. Cunial, J. Kärkkäinen, and V. Mäkinen. 2020. Linear-time string indexing and analysis in small space. ACM Trans. Algor. 16, 2 (2020), article 17.Google ScholarGoogle Scholar
  12. D. Belazzougui, T. Gagie, P. Gawrychowski, J. Kärkkäinen, A. Ordóñez, S. J. Puglisi, and Y. Tabei. 2015b. Queries on LZ-bounded encodings. In Proceedings of the 25th Data Compression Conference (DCC’15). 83--92.Google ScholarGoogle Scholar
  13. D. Belazzougui and G. Navarro. 2015. Optimal lower and upper bounds for representing sequences. ACM Trans. Algor. 11, 4 (2015), article 31.Google ScholarGoogle Scholar
  14. D. Belazzougui and S. J. Puglisi. 2016. Range predecessor and lempel-ziv parsing. In Proceedings of the 27th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’16). 2053--2071.Google ScholarGoogle Scholar
  15. T. Beller, M. Zwerger, S. Gog, and E. Ohlebusch. 2013. Space-efficient construction of the burrows-wheeler transform. In Proceedings of the 20th International Symposium on String Processing and Information Retrieval (SPIRE’13). 5--16.Google ScholarGoogle Scholar
  16. M. A. Bender, M. Farach-Colton, G. Pemmasani, S. Skiena, and P. Sumazin. 2005. Lowest common ancestors in trees and directed acyclic graphs. J. Algor. 57, 2 (2005), 75--94.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. P. Bille, M. B. Ettienne, I. L. Gørtz, and H. W. Vildhøj. 2017a. Time-space trade-offs for lempel-ziv compressed indexing. In Proceedings of the 28th Annual Symposium on Combinatorial Pattern Matching (CPM’17). 16:1--16:17.Google ScholarGoogle Scholar
  18. P. Bille, M. B. Ettienne, I. L. Gørtz, and H. W. Vildhøj. 2018. Time-space trade-offs for lempel-ziv compressed indexing. Theor. Comput. Sci. 713 (2018), 66--77.Google ScholarGoogle ScholarCross RefCross Ref
  19. P. Bille, I. L. Gørtz, and N. Prezza. 2017b. Space-efficient re-pair compression. In Proceedings of the 27th Data Compression Conference (DCC’17). 171--180.Google ScholarGoogle Scholar
  20. P. Bille, I. L. Gørtz, B. Sach, and H. W. Vildhøj. 2014. Time-space trade-offs for longest common extensions. J. Discr. Algor. 25 (2014), 42--50.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. A. Blumer, J. Blumer, D. Haussler, R. M. McConnell, and A. Ehrenfeucht. 1987. Complete inverted files for efficient text retrieval and analysis. J. ACM 34, 3 (1987), 578--595.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. C. Boucher, T. Gagie, A. Kuhnle, B. Langmead, G. Manzini, and T. Mun. 2019. Prefix-free parsing for building big BWTs. Algor. Molec. Biol. 14, 1 (2019), 13:1--13:15.Google ScholarGoogle Scholar
  23. S. Büttcher, C. L. A. Clarke, and G. V. Cormack. 2010. Information Retrieval: Implementing and Evaluating Search Engines. MIT Press.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. T. M. Chan, K. G. Larsen, and M. Pătraşcu. 2011. Orthogonal range searching on the RAM, revisited. In Proceedings of the 27th ACM Symposium on Computational Geometry (SoCG’11). 1--10.Google ScholarGoogle Scholar
  25. G. Chen, S. J. Puglisi, and W. F. Smyth. 2008. Lempel-ziv factorization using less time 8 space. Math. Comput. Sci. 1 (2008), 605--623.Google ScholarGoogle ScholarCross RefCross Ref
  26. A. R. Christiansen and M. B. Ettienne. 2018. Compressed indexing with signature grammars. In Proceedings of the13th Latin American Symposium on Theoretical Informatics (LATIN’18). 331--345.Google ScholarGoogle Scholar
  27. A. R. Christiansen, M. B. Ettienne, T. Kociumaka, G. Navarro, and N. Prezza. 2020. Optimal-time dictionary-compressed indexes. ACM Transactions on Algorithms 17, 1, Article 8 (2020), 207--219.Google ScholarGoogle Scholar
  28. F. Claude, A. Fariña, M. Martínez-Prieto, and G. Navarro. 2016. Universal indexes for highly repetitive document collections. Inf. Syst. 61 (2016), 1--23.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. F. Claude and G. Navarro. 2009. Self-indexed text compression using straight-line programs. In Proceedings of the 34th International Symposium on Mathematical Foundations of Computer Science (MFCS’09). 235--246.Google ScholarGoogle Scholar
  30. F. Claude and G. Navarro. 2011. Self-indexed grammar-based compression. Fundam. Inf. 111, 3 (2011), 313--337.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. F. Claude and G. Navarro. 2012. Improved grammar-based compressed indexes. In Proceedings of the 19th International Symposium on String Processing and Information Retrieval (SPIRE’12). 180--192.Google ScholarGoogle Scholar
  32. F. Claude, G. Navarro, and A. Pacheco. 2021. Grammar-compressed indexes with logarithmic search time. Journal of Computer and System Sciences 118 (2021), 53--74.Google ScholarGoogle ScholarCross RefCross Ref
  33. R. Cole and U. Vishkin. 1986. Deterministic coin tossing with applications to optimal parallel list ranking. Inf. Contr. 70, 1 (1986), 32--53.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. M. Crochemore and C. Hancart. 1997. Automata for matching patterns. In Handbook of Formal Languages. Springer, 399--462.Google ScholarGoogle Scholar
  35. M. Crochemore and W. Rytter. 2002. Jewels of Stringology. World Scientific.Google ScholarGoogle Scholar
  36. M. Farach and M. Thorup. 1995. String matching in lempel-ziv compressed strings. In Proceedings of the 27th Annual ACM Symposium on Theory of Computing (STOC’95). 703--712.Google ScholarGoogle Scholar
  37. M. Farach-Colton, P. Ferragina, and S. Muthukrishnan. 2000. On the sorting-complexity of suffix tree construction. J. ACM 47, 6 (2000), 987--1011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. H. Ferrada, T. Gagie, T. Hirvola, and S. J. Puglisi. 2014. Hybrid indexes for repetitive datasets. Philos. Trans. Roy. Soc. A 372, 2016 (2014), article 20130137.Google ScholarGoogle Scholar
  39. H. Ferrada, D. Kempa, and S. J. Puglisi. 2018. Hybrid indexing revisited. In Proceedings of the 20th Workshop on Algorithm Engineering and Experiments (ALENEX’18). 1--8.Google ScholarGoogle Scholar
  40. P. Ferragina, T. Gagie, and G. Manzini. 2012. Lightweight data indexing and compression in external memory. Algorithmica 63, 3 (2012), 707--730.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. P. Ferragina and R. Grossi. 1999. The string b-tree: A new data structure for string search in external memory and its applications. J. ACM 46, 2 (1999), 236--280.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. P. Ferragina and G. Manzini. 2000. Opportunistic data structures with applications. In Proceedings of the 41st IEEE Symposium on Foundations of Computer Science (FOCS’00). 390--398.Google ScholarGoogle Scholar
  43. P. Ferragina and G. Manzini. 2005. Indexing compressed texts. J. ACM 52, 4 (2005), 552--581.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. P. Ferragina, G. Manzini, V. Mäkinen, and G. Navarro. 2007. Compressed representations of sequences and full-text indexes. ACM Trans. Algor. 3, 2 (2007), article 20.Google ScholarGoogle Scholar
  45. J. Fischer, T. Gagie, P. Gawrychowski, and T. Kociumaka. 2015a. Approximating LZ77 via small-space multiple-pattern matching. In Proceedings of the 23rd Annual European Symposium on Algorithms (ESA). 533--544.Google ScholarGoogle Scholar
  46. J. Fischer and V. Heun. 2011. Space-efficient preprocessing schemes for range minimum queries on static arrays. SIAM J. Comput. 40, 2 (2011), 465--492.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. J. Fischer, T. I, and D. Köppl. 2015b. Lempel ziv computation in small space (LZ-CISS). In Proceedings of the 26th Annual Symposium on Combinatorial Pattern Matching (CPM’15). 172--184.Google ScholarGoogle Scholar
  48. J. Fischer, T. I. D. Köppl, and K. Sadakane. 2018. Lempel-ziv factorization powered by space efficient suffix trees. Algorithmica 80, 7 (2018), 2048--2081.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. J. Fuentes-Sepúlveda, G. Navarro, and Y. Nekrich. 2020. Parallel computation of the burrows wheeler transform in compact space. Theor. Comput. Sci. 812 (2020), 123--136.Google ScholarGoogle ScholarCross RefCross Ref
  50. T. Gagie, P. Gawrychowski, J. Kärkkäinen, Y. Nekrich, and S. J. Puglisi. 2012. A faster grammar-based self-index. In Proceedings of the 6th International Conference on Language and Automata Theory and Applications (LATA’12). 240--251.Google ScholarGoogle Scholar
  51. T. Gagie, P Gawrychowski, J. Kärkkäinen, Y. Nekrich, and S. J. Puglisi. 2014. LZ77-based self-indexing with faster pattern matching. In Proceedings of the 11th Latin American Symposium on Theoretical Informatics (LATIN’14). 731--742.Google ScholarGoogle Scholar
  52. T. Gagie, T. I, G. Manzini, G. Navarro, H. Sakamoto, and Y. Takabatake. 2019. Rpair: Scaling up repair with rsync. In Proceedings of the 26th International Symposium on String Processing and Information Retrieval (SPIRE’19). 35--44.Google ScholarGoogle Scholar
  53. T. Gagie, G. Navarro, and N. Prezza. 2018. Optimal-time text indexing in BWT-runs bounded space. In Proceedings of the 29th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’18). 1459--1477.Google ScholarGoogle Scholar
  54. T. Gagie, G. Navarro, and N. Prezza. 2020. Fully-functional suffix trees and optimal text searching in BWT-runs bounded space. J. ACM 67, 1 (2020), article 2.Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. T. Gagie and S. J. Puglisi. 2015. Searching and indexing genomic databases via kernelization. Front. Bioeng. Biotechnol. 3 (2015), article 12.Google ScholarGoogle Scholar
  56. K. Goto and H. Bannai. 2013. Simpler and faster lempel ziv factorization. In Proceedings of the 23rd Data Compression Conference (DCC’13). 133--142.Google ScholarGoogle Scholar
  57. K. Goto and H. Bannai. 2014. Space efficient linear time lempel-ziv Factorization for Small Alphabets. In Proceedings of the 24th Data Compression Conference (DCC’14). 163--172.Google ScholarGoogle Scholar
  58. R. Grossi. 2011. A quick tour on suffix arrays and compressed suffix arrays. Theor. Comput. Sci. 412, 27 (2011), 2964--2973.Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. R. Grossi and J. S. Vitter. 2000. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. In Proceedings of the 32nd ACM Symposium on Theory of Computing (STOC’00). 397--406.Google ScholarGoogle Scholar
  60. D. Gusfield. 1997. Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge University Press.Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. P. Gawrychowski, A. Karczmarz, T. Kociumaka, J. Lacki, and P. Sankowski. 2015. Optimal dynamic strings. CoRR 1511.02612 (2015).Google ScholarGoogle Scholar
  62. W.-K. Hon, T.-W. Lam, K. Sadakane, W.-K. Sung, and S.-M. Yiu. 2007. A space and time efficient algorithm for constructing compressed suffix arrays. Algorithmica 48, 1 (2007), 23--36.Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. W.-K. Hon, K. Sadakane, and W.-K. Sung. 2009. Breaking a time-and-space barrier in constructing full-text indices. SIAM J. Comput. 38, 6 (2009), 2162--2178.Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. A. Jeż. 2015. Approximation of grammar-based compression via recompression. Theor. Comput. Sci. 592 (2015), 115--134.Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. A. Jeż. 2016. A really simple approximation of smallest grammar. Theor. Comput. Sci. 616 (2016), 141--150.Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. J. Kärkkäinen. 2007. Fast BWT in small space by blockwise suffix sorting. Theor. Comput. Sci. 387, 3 (2007), 249--257.Google ScholarGoogle ScholarDigital LibraryDigital Library
  67. J. Kärkkäinen, D. Kempa, and S. J. Puglisi. 2013. Lightweight lempel-ziv parsing. In Proceedings of the 12th International Symposium on Experimental Algorithms (SEA’13). 139--150.Google ScholarGoogle Scholar
  68. J. Kärkkäinen, D. Kempa, and S. J. Puglisi. 2014. Lempel-ziv parsing in external memory. In Proceedings of the 24th Data Compression Conference (DCC’14). 153--162.Google ScholarGoogle Scholar
  69. J. Kärkkäinen, D. Kempa, and S. J. Puglisi. 2016. Lazy lempel-ziv factorization algorithms. ACM J. Exp. Algor. 21, 1 (2016), 2.4:1--2.4:19.Google ScholarGoogle Scholar
  70. J. Kärkkäinen, P. Sanders, and S. Burkhardt. 2006. Linear work suffix array construction. J. ACM 53, 6 (2006), 918--936.Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. J. Kärkkäinen and E. Ukkonen. 1996. Lempel-ziv parsing and sublinear-size index structures for string matching. In Proceedings of the 3rd South American Workshop on String Processing (WSP’96). 141--155.Google ScholarGoogle Scholar
  72. R. M. Karp and M. O. Rabin. 1987. Efficient randomized pattern-matching algorithms. IBM J. Res. Dev. 2 (1987), 249--260.Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. D. Kempa. 2019. Optimal construction of compressed indexes for highly repetitive texts. In Proceedings of the 30th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’19). 1344--1357.Google ScholarGoogle ScholarCross RefCross Ref
  74. D. Kempa and T. Kociumaka. 2019. String synchronizing sets: Sublinear-time BWT construction and optimal LCE data structure. In Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing (STOC’19). 756--767.Google ScholarGoogle Scholar
  75. D. Kempa and D. Kosolobov. 2017. LZ-end parsing in compressed space. In Proceedings of the 27th Data Compression Conference (DCC’17). 350--359.Google ScholarGoogle Scholar
  76. D. Kempa and S. J. Puglisi. 2013. Lempel-ziv factorization: Simple, fast, practical. In Proceedings of the 15th Workshop on Algorithm Engineering and Experiments (ALENEX’13). 103--112.Google ScholarGoogle Scholar
  77. D. K. Kim, J. S. Sim, H. Park, and K. Park. 2005. Constructing suffix arrays in linear time. J. Discr. Algor. 3, 2–4 (2005), 126--142.Google ScholarGoogle ScholarCross RefCross Ref
  78. P. Ko and S. Aluru. 2005. Space efficient linear time construction of suffix arrays. J. Discr. Algor. 3, 2–4 (2005), 143--156.Google ScholarGoogle ScholarCross RefCross Ref
  79. T. Kociumaka, G. Navarro, and N. Prezza. 2020. Towards a definitive measure of repetitiveness. In Proceedings of the 14th Latin American Symposium on Theoretical Informatics (LATIN’20).Google ScholarGoogle Scholar
  80. D. Köppl, T. I. I. Furuya, Y. Takabatake, K. Sakai, and K. Goto. 2020. Re-pair in small space. In Proceedings of the 30th Data Compression Conference (DCC’20). 377.Google ScholarGoogle Scholar
  81. D. Köppl and K. Sadakane. 2016. Lempel-ziv computation in compressed space (LZ-CICS). In Proceedings of the 26th Data Compression Conference (DCC’16). 3--12.Google ScholarGoogle Scholar
  82. S. Kreft and G. Navarro. 2011. Self-indexing based on LZ77. In Proceedings of the 22nd Annual Symposium on Combinatorial Pattern Matching (CPM’11). 41--54.Google ScholarGoogle Scholar
  83. S. Kreft and G. Navarro. 2013. On compressing and indexing repetitive sequences. Theor. Comput. Sci. 483 (2013), 115--133.Google ScholarGoogle ScholarDigital LibraryDigital Library
  84. A. Kuhnle, T. Mun, C. Boucher, T. Gagie, B. Langmead, and G. Manzini. 2020. Efficient construction of a complete index for pan-genomics read alignment. J. Comput. Biol. 27, 4 (2020), 500--513.Google ScholarGoogle ScholarCross RefCross Ref
  85. J. Larsson and A. Moffat. 2000. Off-line dictionary-based compression. Proc. IEEE 88, 11 (2000), 1722--1732.Google ScholarGoogle ScholarCross RefCross Ref
  86. E. Linstead, S. Bajracharya, T. Ngo, P. Rigor, C. Lopes, and P. Baldi. 2009. Sourcerer: Mining and searching internet-scale software repositories. Data Min. Knowl. Discov. 18, 2 (2009), 300--336.Google ScholarGoogle ScholarDigital LibraryDigital Library
  87. B. Liu. 2007. Web Data Mining: Exploring Hyperlinks, Contents and Usage Data. Springer.Google ScholarGoogle ScholarDigital LibraryDigital Library
  88. V. Mäkinen, D. Belazzougui, F. Cunial, and A. I. Tomescu. 2015. Genome-Scale Algorithm Design. Cambridge University Press.Google ScholarGoogle Scholar
  89. V. Mäkinen and G. Navarro. 2005. Succinct suffix arrays based on run-length encoding. Nord. J. Comput. 12, 1 (2005), 40--66.Google ScholarGoogle ScholarDigital LibraryDigital Library
  90. V. Mäkinen and G. Navarro. 2008. Dynamic entropy-compressed sequences and full-text indexes. ACM Trans. Algor. 4, 3 (2008), article 32.Google ScholarGoogle Scholar
  91. V. Mäkinen, G. Navarro, J. Sirén, and N. Välimäki. 2010. Storage and retrieval of highly repetitive sequence collections. J. Comput. Biol. 17, 3 (2010), 281--308.Google ScholarGoogle ScholarCross RefCross Ref
  92. U. Manber and G. Myers. 1993. Suffix arrays: A new method for on-line string searches. SIAM J. Comput. 22, 5 (1993), 935--948.Google ScholarGoogle ScholarDigital LibraryDigital Library
  93. S. Maruyama, M. Nakahara, N. Kishiue, and H. Sakamoto. 2011. ESP-Index: A compressed index based on edit-sensitive parsing. In Proceedings of the 18th International Symposium on String Processing and Information Retrieval (SPIRE’11). 398--409.Google ScholarGoogle Scholar
  94. S. Maruyama, M. Nakahara, N. Kishiue, and H. Sakamoto. 2013a. ESP-index: A compressed index based on edit-sensitive parsing. J. Discr. Algor. 18 (2013), 100--112.Google ScholarGoogle ScholarDigital LibraryDigital Library
  95. S. Maruyama, H. Sakamoto, and M. Takeda. 2012. An online algorithm for lightweight grammar-based compression. Algorithms 5, 2 (2012), 213--235.Google ScholarGoogle ScholarCross RefCross Ref
  96. S. Maruyama, Y. Tabei, H. Sakamoto, and K. Sadakane. 2013b. Fully-online grammar compression. In Proceedings of the 20th International Symposium on String Processing and Information Retrieval (SPIRE’13). 218--â229.Google ScholarGoogle Scholar
  97. E. McCreight. 1976. A space-economical suffix tree construction algorithm. J. ACM 23, 2 (1976), 262--272.Google ScholarGoogle ScholarDigital LibraryDigital Library
  98. K. Mehlhorn, R. Sundar, and C. Uhrig. 1997. Maintaining dynamic sequences under equality tests in polylogarithmic time. Algorithmica 17, 2 (1997), 183--198.Google ScholarGoogle ScholarCross RefCross Ref
  99. D. Morrison. 1968. PATRICIA—Practical algorithm to retrieve information coded in alphanumeric. J. ACM 15, 4 (1968), 514--534.Google ScholarGoogle ScholarDigital LibraryDigital Library
  100. J. I. Munro, G. Navarro, and Y. Nekrich. 2017. Space-efficient construction of compressed indexes in deterministic linear time. In Proceedings of the 28th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA’17). 408--424.Google ScholarGoogle Scholar
  101. J. I. Munro and Y. Nekrich. 2015. Compressed data structures for dynamic sequences. In Proceedings of the 23rd Annual European Symposium on Algorithms (ESA’15). 891--902.Google ScholarGoogle Scholar
  102. G. Navarro. 2017. A self-index on block trees. In Proceedings of the 24th International Symposium on String Processing and Information Retrieval (SPIRE’17). 278--289.Google ScholarGoogle ScholarCross RefCross Ref
  103. G. Navarro. 2020. Indexing highly repetitive string collections, Part I: Repetitiveness measures CoRR 2004.02781 (2020).Google ScholarGoogle Scholar
  104. G. Navarro and V. Mäkinen. 2007. Compressed full-text indexes. Comput. Surv. 39, 1 (2007), article 2.Google ScholarGoogle Scholar
  105. G. Navarro and Y. Nekrich. 2017. Time-optimal top-k document retrieval. SIAM J. Comput. 46, 1 (2017), 89--113.Google ScholarGoogle ScholarDigital LibraryDigital Library
  106. G. Navarro and N. Prezza. 2019. Universal compressed text indexing. Theor. Comput. Sci. 762 (2019), 41--50.Google ScholarGoogle ScholarCross RefCross Ref
  107. T. Nishimoto, T. I, S. Inenaga, H. Bannai, and M. Takeda. 2015. Dynamic index, LZ factorization, and LCE queries in compressed space. CoRR 1504.06954 (2015).Google ScholarGoogle Scholar
  108. T. Nishimoto, T. I, S. Inenaga, H. Bannai, and M. Takeda. 2020. Dynamic index and LZ factorization in compressed space. Discr. Appl. Math. 274 (2020), 116--129.Google ScholarGoogle ScholarCross RefCross Ref
  109. T. Nishimoto and Y. Tabei. 2019. LZRR: LZ77 parsing with right reference. In Proceedings of the 29th Data Compression Conference (DCC’19). 211--220.Google ScholarGoogle Scholar
  110. T. Nishimoto and Y. Tabei. 2020. Faster queries on BWT-runs compressed indexes. CoRR 2006.05104 (2020).Google ScholarGoogle Scholar
  111. T. Nishimoto, Y. Takabatake, and Y. Tabei. 2018. A dynamic compressed self-index for highly repetitive text collections. In Proceedings of the 28th Data Compression Conference (DCC’18). 287--296.Google ScholarGoogle Scholar
  112. E. Ohlebusch. 2013. Bioinformatics Algorithms: Sequence Analysis, Genome Rearrangements, and Phylogenetic Reconstruction. Oldenbusch Verlag.Google ScholarGoogle Scholar
  113. E. Ohlebusch and S. Gog. 2011. Lempel-ziv factorization revisited. In Proceedings of the 22nd Annual Symposium on Combinatorial Pattern Matching (CPM’11). 15--26.Google ScholarGoogle Scholar
  114. T. Ohno, K. Sakai, Y. Takabatake, T. I, and H. Sakamoto. 2018. A faster implementation of online RLBWT and its application to LZ77 parsing. J. Discr. Algor. 52–53 (2018), 18--28.Google ScholarGoogle Scholar
  115. D. Okanohara and K. Sadakane. 2009. A linear-time burrows-wheeler transform using induced sorting. In Proceedings of the 16th International Symposium on String Processing and Information Retrieval (SPIRE’09), Lecture Notes in Computer Science, Vol. 5721. 90--101.Google ScholarGoogle Scholar
  116. A. Policriti and N. Prezza. 2015. Fast online lempel-ziv factorization in compressed space. In Proceedings of the 22nd String Processing and Information Retrieval (SPIRE’15). 13--20.Google ScholarGoogle Scholar
  117. A. Policriti and N. Prezza. 2018. LZ77 computation based on the run-length encoded BWT. Algorithmica 80, 7 (2018), 1986--2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  118. M. Rodeh, V. R. Pratt, and S. Even. 1981. Linear algorithm for data compression via string matching. J. ACM 28, 1 (1981), 16--24.Google ScholarGoogle ScholarDigital LibraryDigital Library
  119. L. M. S. Russo, A. Correia, G. Navarro, and A. P. Francisco. 2020. Approximating optimal bidirectional macro schemes. In Proceedings of the 30th Data Compression Conference (DCC’20). 153--162.Google ScholarGoogle Scholar
  120. S. C. Sahinalp and U. Vishkin. 1995. Data Compression Using Locally Consistent Parsing. Technical Report. Department of Computer Science, University of Maryland.Google ScholarGoogle Scholar
  121. K. Sakai, T. Ohno, K. Goto, Y. Takabatake, T. I, and H. Sakamoto. 2019. RePair in compressed space and time. In Proceedings of the 29th Data Compression Conference (DCC’19). 518--527.Google ScholarGoogle Scholar
  122. H. Sakamoto. 2005. A fully linear-time approximation algorithm for grammar-based compression. J. Discr. Algor. 3, 2â4 (2005), 416--430.Google ScholarGoogle ScholarCross RefCross Ref
  123. F. Silvestri. 2010. Mining query logs: Turning search usage data into knowledge. Found. Trends Inf. Retriev. 4, 1--2 (2010), 1--174.Google ScholarGoogle ScholarDigital LibraryDigital Library
  124. J. Sirén. 2016. Burrows-wheeler transform for terabases. In Proceedings of the 26th Data Compression Conference (DCC’16). 211--220.Google ScholarGoogle ScholarCross RefCross Ref
  125. J. Sirén, N. Välimäki, V. Mäkinen, and G. Navarro. 2008. Run-length compressed indexes are superior for highly repetitive sequence collections. In Proceedings of the 15th International Symposium on String Processing and Information Retrieval (SPIRE’08). 164--175.Google ScholarGoogle Scholar
  126. J. A. Storer and T. G. Szymanski. 1982. Data compression via textual substitution. J. ACM 29, 4 (1982), 928--951.Google ScholarGoogle ScholarDigital LibraryDigital Library
  127. J.-H. Su, Y.-T. Huang, H.-H. Yeh, and V. S. Tseng. 2010. Effective content-based video retrieval using pattern-indexing and matching techniques. Expert Syst. Appl. 37, 7 (2010), 5068--5085.Google ScholarGoogle ScholarDigital LibraryDigital Library
  128. Y. Takabatake, T. I, and H. Sakamoto. 2017. A space-optimal grammar compression. In Proceedings of the 25th Annual European Symposium on Algorithms (ESA’17). 67:1--67:15.Google ScholarGoogle Scholar
  129. Y. Takabatake, Y. Tabei, and H. Sakamoto. 2014. Improved ESP-index: A practical self-index for highly repetitive texts. In Proceedings of the 13th International Symposium on Experimental Algorithms (SEA’14). 338--350.Google ScholarGoogle Scholar
  130. T. Takagi, K. Goto, Y. Fujishige, S. Inenaga, and H. Arimura. 2017. Linear-size CDAWG: New repetition-aware indexing and grammar compression. In Proceedings of the 24th International Symposium on String Processing and Information Retrieval (SPIRE’17). 304--316.Google ScholarGoogle Scholar
  131. K. Tsuruta, D. Köppl, Y. Nakashima, S. Inenaga, H. Bannai, and M. Takeda. 2020. Grammar-compressed Self-index with lyndon words. CoRR 2004.05309 (2020).Google ScholarGoogle Scholar
  132. R. Typke, F. Wiering, and R. Veltkamp. 2005. A survey of music information retrieval systems. In Proceedings of the 6th International Conference on Music Information Retrieval (ISMIR’05). 153--160.Google ScholarGoogle Scholar
  133. E. Ukkonen. 1995. On-line construction of suffix trees. Algorithmica 14, 3 (1995), 249--260.Google ScholarGoogle ScholarDigital LibraryDigital Library
  134. D. Valenzuela, D. Kosolobov, G. Navarro, and S. J. Puglisi. 2020. Lempel-Ziv like parsing in small space. Algorithmica 82, 11 (2020), 3195--3215.Google ScholarGoogle ScholarDigital LibraryDigital Library
  135. P. Weiner. 1973. Linear pattern matching algorithms. In Proceedings of the 14th IEEE Symposium on Switching and Automata Theory (FOCS’73). 1--11.Google ScholarGoogle ScholarDigital LibraryDigital Library
  136. J. Yamamoto, T. I, H. Bannai, S. Inenaga, and M. Takeda. 2014. Faster compact on-line lempel-ziv factorization. In Proceedings of the 31st International Symposium on Theoretical Aspects of Computer Science (STACS’14). 675--686.Google ScholarGoogle Scholar
  137. J. Ziv and A. Lempel. 1978. Compression of individual sequences via variable length coding. IEEE Trans. Inf. Theory 24, 5 (1978), 530--536.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Indexing Highly Repetitive String Collections, Part II: Compressed Indexes

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Computing Surveys
      ACM Computing Surveys  Volume 54, Issue 2
      March 2022
      800 pages
      ISSN:0360-0300
      EISSN:1557-7341
      DOI:10.1145/3450359
      Issue’s Table of Contents

      Copyright © 2021 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 9 February 2021
      • Revised: 1 October 2020
      • Accepted: 1 October 2020
      • Received: 1 April 2020
      Published in csur Volume 54, Issue 2

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format .

    View HTML Format