Lempel–Ziv Factorization Powered by Space Efficient Suffix Trees

Fischer, Johannes; I, Tomohiro; Köppl, Dominik; Sadakane, Kunihiko

doi:10.1007/s00453-017-0333-1

Lempel–Ziv Factorization Powered by Space Efficient Suffix Trees

Published: 25 July 2017

Volume 80, pages 2048–2081, (2018)
Cite this article

Algorithmica Aims and scope Submit manuscript

Johannes Fischer¹,
Tomohiro I²,
Dominik Köppl ORCID: orcid.org/0000-0002-8721-4444¹ &
…
Kunihiko Sadakane³

670 Accesses
14 Citations
Explore all metrics

Abstract

We show that both the Lempel–Ziv-77 and the Lempel–Ziv-78 factorization of a text of length n on an integer alphabet of size \(\sigma \) can be computed in \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}\left( n\right) \) time with either \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}\left( n \lg \sigma \right) \) bits of working space, or \((1+\epsilon ) n \lg n + \mathop {}\mathopen {}\mathcal {O}\mathopen {}\left( n\right) \) bits (for a constant \(\epsilon >0\)) of working space (including the space for the output, but not the text).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Sublinear Time Lempel-Ziv (LZ77) Factorization

Linear Time Lempel-Ziv Factorization: Simple, Fast, Small

New Advances in Rightmost Lempel-Ziv

Notes

In the initial submission, the \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}\left( n\right) \) deterministic time suffix tree construction algorithm of Munro et al. [43] was not yet published. Our former results were \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}\left( n\right) \) randomized or \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}\left( n \lg \lg \sigma \right) \) deterministic time based on the suffix tree construction algorithm of [2].
More precisely, we use the permuted longest common prefix array that can access \(\mathsf {LCP}\) only in conjunction with \(\mathsf {SA}\).
The time bound for computing the suffix array has recently been improved to \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}\left( n\right) \) by two in-place suffix sorting algorithms [19, 40]. Our succinct suffix tree is composed of both \(\mathsf {SA}\) and \(\mathsf {ISA}\), yielding \((1+\epsilon )n \lg n\) bits and \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}\left( n/\epsilon \right) \) construction time. This construction time is the bottleneck of the succinct suffix tree construction and the later described algorithms. Hence, we can lower the time \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}\left( n/\epsilon ^2\right) \) to \(\mathop {}\mathopen {}\mathcal {O}\mathopen {}\left( n/\epsilon \right) \) in Theorem 2.8 and Corollaries 3.2, 3.7, and 4.10.

References

Amir, A., Farach, M., Idury, R.M., Poutré, J.A.L., Schäffer, A.A.: Improved dynamic dictionary matching. Inf. Comput. 119(2), 258–282 (1995)
Article MathSciNet MATH Google Scholar
Belazzougui, D.: Linear time construction of compressed text indices in compact space. In: Proceedings of the STOC, pp. 148–193. ACM (2014)
Belazzougui, D., Puglisi, S.J.: Range predecessor and Lempel–Ziv parsing. In: Proceedings of the SODA, pp. 2053–2071. ACM/SIAM(2016)
Belazzougui, D., Mäkinen, V., Valenzuela, D.: Compressed suffix array. In: Encyclopedia of Algorithms, pp. 386–390. Springer (2016)
Benoit, D., Demaine, E.D., Munro, J.I., Raman, R., Raman, V., Rao, S.S.: Representing trees of higher degree. Algorithmica 43(4), 275–292 (2005)
Article MathSciNet MATH Google Scholar
Clark, D.R.: Compact Pat Trees. Ph.D. Thesis. University of Waterloo (1996)
Crochemore, M.: Transducers and repetitions. Theor. Comput. Sci. 45(1), 63–86 (1986)
Article MathSciNet MATH Google Scholar
Crochemore, M., Landau, G.M., Ziv-Ukelson, M.: A subquadratic sequence alignment algorithm for unrestricted scoring matrices. SIAM J. Comput. 32(6), 1654–1673 (2003)
Article MathSciNet MATH Google Scholar
Duval, J., Kolpakov, R., Kucherov, G., Lecroq, T., Lefebvre, A.: Linear-time computation of local periods. Theor. Comput. Sci. 326(1–3), 229–240 (2004)
Article MathSciNet MATH Google Scholar
El-Zein, H., Munro, J.I., Robertson, M.: Raising permutations to powers in place. In: Proceedings of the ISAAC, volume 64 of LIPIcs, pp. 29:1–29:12. Schloss Dagstuhl (2016)
Farach, M.: Optimal suffix tree construction with large alphabets. In: Foundations of Computer Science, pp. 137–143. IEEE Computer Society (1997)
Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM 52(4), 552–581 (2005)
Article MathSciNet MATH Google Scholar
Fischer, J., Gawrychowski, P.: Alphabet-dependent string searching with wexponential search trees. In: Proceedings of the CPM, volume 9133 of LNCS, pp. 160–171. Springer (2015)
Fischer, J., Heun, V.: Space efficient preprocessing schemes for range minimum queries on static arrays. SIAM J. Comput. 40(2), 465–492 (2011)
Article MathSciNet MATH Google Scholar
Fischer, J., Heun, V.: Space efficient preprocessing schemes for range minimum queries on static arrays. SIAM J. Comput. 40(2), 465–492 (2011)
Article MathSciNet MATH Google Scholar
Franceschini, G., Muthukrishnan, S., Pǎtraşcu, M.: Radix sorting with no extra space. In: Proceedings of the ESA, volume 4698 of LNCS, pp. 194–205. Springer (2007)
Gagie, T., Gawrychowski, P., Kärkkäinen, J., Nekrich, Y., Puglisi, S.J.: A faster grammar-based self-index. In: Proceedings of the LATA, volume 7183 of LNCS, pp. 240–251. Springer (2012)
Gagie, T., Gawrychowski, P., Kärkkäinen, J., Nekrich, Y., Puglisi, S.J.: LZ77-based self-indexing with faster pattern matching. In: Proceedings of the Latin, 8392 of LNCS, pp. 731–742. Springer (2014)
Goto, K.: Optimal time and space construction of suffix arrays and LCP arrays for integer alphabets. ArXiv CoRR, arXiv:1703.01009 (2017)
Goto, K., Bannai, H.: Simpler and faster Lempel Ziv factorization. In: Proceedings of the DCC, pp. 133–142. IEEE Computer Society (2013)
Goto, K., Bannai, H.: Space efficient linear time Lempel–Ziv factorization for small alphabets. In: Proceedings of the DCC, pp. 163–172. IEEE Computer Society (2014)
Grossi, R., Vitter, J.S.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. Comput. 35(2), 378–407 (2005)
Article MathSciNet MATH Google Scholar
Gusfield, D., Stoye, J.: Linear time algorithms for finding and representing all the tandem repeats in a string. J. Comput. Syst. Sci. 69(4), 525–546 (2004)
Article MathSciNet MATH Google Scholar
Hon, W.-K., Sadakane, K., Sung, W.-K.: Breaking a time-and-space barrier in constructing full-text indices. In: Proceedings of the FOCS, pp. 251–260. IEEE Computer Society (2003)
Jacobson, G.J.: Space-efficient static trees and graphs. In: Proceedings of the FOCS, pp. 549–554. IEEE Computer Society (1989)
Jansson, J., Sadakane, K., Sung, W.-K.: Ultra-succinct representation of ordered trees with applications. J. Comput. Syst. Sci. 78(2), 619–631 (2012)
Article MathSciNet MATH Google Scholar
Jansson, J., Sadakane, K., Sung, W.-K.: Linked dynamic tries with applications to LZ-compression in sublinear time and space. Algorithmica 71(4), 969–988 (2015)
Article MathSciNet MATH Google Scholar
Kärkkäinen, J., Sutinen, E.: Lempel–Ziv index for q-grams. Algorithmica 21(1), 137–154 (1998)
Article MathSciNet MATH Google Scholar
Kärkkäinen, J., Ukkonen, E.: Lempel–Ziv parsing and sublinear-size index structures for string matching. In: South American Workshop on String Processing (WSP), pp. 141–155. Carleton University Press (1996)
Kärkkäinen, J., Sanders, P., Burkhardt, S.: Linear work suffix array construction. J. ACM 53(6), 1–19 (2006)
Article MathSciNet MATH Google Scholar
Kärkkäinen, J., Kempa, D., Puglisi, S.J.: Linear time Lempel–Ziv factorization: simple, fast, small. In: Proceedings of the CPM, volume 7922 of LNCS, pp. 189–200. Springer (2013)
Kärkkäinen, J., Kempa, D., Puglisi, S.J.: Lightweight Lempel–Ziv parsing. In: Proceedings of the SEA, volume 7933 of LNCS, pp. 139–150. Springer (2013)
Kempa, D., Puglisi, S.J.: Lempel–Ziv factorization: simple, fast, practical. In: Proceedings of the ALENEX, pp. 103–112. SIAM (2013)
Kociumaka, T., Kubica, M., Radoszewski, J., Rytter, W., Walen, T.: A linear time algorithm for seeds computation. In: Proceedings of the SODA, pp. 1095–1112. ACM/SIAM (2012)
Kolpakov, R.M., Kucherov, G.: Finding maximal repetitions in a word in linear time. In: Proceedings of the FOCS, pp. 596–604 (1999)
Kolpakov, R.M., Kucherov, G.: Finding repeats with fixed gap. In: Proceedings of the SPIRE, pp. 162–168. IEEE Computer Society (2000)
Köppl, D., Sadakane, K.: Lempel–Ziv computation in compressed space (LZ-CICS). In: Proceedings of the DCC, pp. 3–12. IEEE Computer Society (2016)
Li, M., Sleep, R.: An LZ78 based string kernel. In: Proceedings of the ADMA, volume 3584 of LNCS, pp. 678–689. Springer (2005)
Li, M., Zhu, Y.: Image classification via LZ78 based string kernel: a comparative study. In: Proceedings of the PAKDD, volume 3918 of LNCS, pp. 704–712. Springer (2006)
Li, Z., Li, J., Huo, H.: Optimal in-place suffix sorting. ArXiv CoRR, arXiv:1610.08305 (2016)
Main, M.G.: Detecting leftmost maximal periodicities. Discrete Appl. Math. 25(1–2), 145–153 (1989)
Article MathSciNet MATH Google Scholar
Manber, U., Myers, E.W.: Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993)
Article MathSciNet MATH Google Scholar
Munro, J.I., Navarro, G., Nekrich, Y.: Space-efficient construction of compressed indexes in deterministic linear time. In: Proceedings of the SODA, pp. 408–424. SIAM (2017)
Nakashima, Y., Tomohiro, I., Inenaga, S., Bannai, H., Takeda, M.: Constructing LZ78 tries and position heaps in linear time for large alphabets. Inf. Process. Lett. 115(9), 655–659 (2015)
Article MathSciNet MATH Google Scholar
Navarro, G.: Indexing text using the Ziv–Lempel trie. J. Discrete Algorithms 2(1), 87–114 (2004)
Article MathSciNet MATH Google Scholar
Navarro, G.: Compact Data Structures: A practical approach. Cambridge University Press, Cambridge (2016)
Book Google Scholar
Navarro, G., Nekrich, Y.: Optimal dynamic sequence representations. SIAM J. Comput. 43(5), 1781–1806 (2014)
Article MathSciNet MATH Google Scholar
Navarro, G., Sadakane, K.: Fully functional static and dynamic succinct trees. ACM Trans. Algorithms 10(3), 16 (2014)
Article MathSciNet MATH Google Scholar
Nong, G.: Practical linear-time \(\cal{O}(1)\)-workspace suffix sorting for constant alphabets. ACM Trans. Inf. Syst. 31(3), 15 (2013)
Article MathSciNet Google Scholar
Ohlebusch, E., Fischer, J., Gog, S.: CST++. In: Proceedings of the SPIRE, volume 6393 of LNCS, pp. 322–333. Springer (2010)
Ouyang, J., Luo, H., Wang, Z., Tian, J., Liu, C., Sheng, K.: FPGA implementation of GZIP compression and decompression for IDC services. In: Proceedings of the FPT, pp. 265–268. IEEE Computer Society (2010)
Richard, G.G., Case, A.: In lieu of swap: analyzing compressed RAM in Mac OS X and Linux. Digit. Investig. 11, 3–12 (2014)
Article Google Scholar
Russo, L.M.S., Navarro, G., Oliveira, A.L.: Fully-compressed suffix trees. In: Proceedings of the LATIN, volume 4957 of LNCS, pp. 362–373. Springer (2008)
Sadakane, K.: Succinct representations of LCP information and improvements in the compressed suffix arrays. In: Proceedings of the SODA, pp. 225–237. ACM/SIAM (2002)
Sadakane, K.: Compressed suffix trees with full functionality. Theory Comput. Syst. 41(4), 589–607 (2007)
Article MathSciNet MATH Google Scholar
Sadakane, K., Grossi, R.: Squeezing succinct data structures into entropy bounds. In: Proceedings of the SODA, pp. 1230–1239. ACM/SIAM (2006)
Storer, J.A., Szymanski, T.G.: Data compression via textual substitution. J. ACM 29(4), 928–951 (1982)
Article MathSciNet MATH Google Scholar
Välimäki, N., Mäkinen, V., Gerlach, W., Dixit, K.: Engineering a compressed suffix tree implementation. ACM J. Exp. Algorithm. 14, 2 (2009)
MathSciNet MATH Google Scholar
Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23(3), 337–343 (1977)
Article MathSciNet MATH Google Scholar
Ziv, J., Lempel, A.: Compression of individual sequences via variable length coding. IEEE Trans. Inf. Theory 24(5), 530–536 (1978)
Article MATH Google Scholar

Download references

Acknowledgements

We thank the anonymous reviewers for their careful reading of our manuscript and their insightful comments and suggestions. We are especially grateful for the reviewer pointing out a simplification of our original solution on how to store the exploration counters for the LZ78 factorizations (Sect. 4.1). Further, we are grateful to Sean Tohidi, who spell-checked the initial submission of this paper during his DAAD RISE internship at TU Dortmund. This research was supported by CREST, JST.

Author information

Authors and Affiliations

Department of Computer Science, TU Dortmund, Dortmund, Germany
Johannes Fischer & Dominik Köppl
Department of Artificial Intelligence, Kyushu Institute of Technology, Fukuoka, Japan
Tomohiro I
Graduate School of Information Science and Technology, University of Tokyo, Tokyo, Japan
Kunihiko Sadakane

Authors

Johannes Fischer
View author publications
You can also search for this author in PubMed Google Scholar
Tomohiro I
View author publications
You can also search for this author in PubMed Google Scholar
Dominik Köppl
View author publications
You can also search for this author in PubMed Google Scholar
Kunihiko Sadakane
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dominik Köppl.

Appendix: List of Identifiers

While describing both factorization algorithms, we used several data structures, among others bit vectors, some with rank or select-support, to achieve the small space bounds. We denote bit vectors with \(B_{\alpha }\) for some letter \(\alpha \).

Table 2 List of data structures with names

Full size table

For all types of LZ-factorizations we use

\(B_{W}\) marking all witness nodes,
the array W mapping witness ids to
- (LZ77) text positions, or
- (LZ78) factor indices.

In LZ77 we use

\(B_{V}\) marking visited nodes

In LZ78 we use

\(B_{C}\) counts \(n_v\) of each partially explored node v,
\(B_{V}\) marking suffix tree nodes represented in the LZ trie (their ingoing edges are fully explored),
\(B_{LZ}\) marking explicit LZ nodes, and
the array \(W'\) mapping LZ nodes to factor indices,
\(B_{E}\) marking the edge witnesses.

The algorithms based on the SST additionally use

\(B_{T}\) marking the factor positions, used also for representing the length of a factor.

We count the number of

factors by z
witnesses by \({z_{\text {W}}}\)
referencing factors by \({z_{\text {R}}}\)
fresh factors by \({z_{\text {F}}}\).

Figure 14 highlights the kind of suffix tree representation (either compressed or succinct suffix tree) used in each subsection of the algorithmic part of the article. Table 2 lists the particular data structures of each such subsection.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Fischer, J., I, T., Köppl, D. et al. Lempel–Ziv Factorization Powered by Space Efficient Suffix Trees. Algorithmica 80, 2048–2081 (2018). https://doi.org/10.1007/s00453-017-0333-1

Download citation

Received: 28 June 2016
Accepted: 07 June 2017
Published: 25 July 2017
Issue Date: July 2018
DOI: https://doi.org/10.1007/s00453-017-0333-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Lempel–Ziv Factorization Powered by Space Efficient Suffix Trees

Abstract

Access this article

Similar content being viewed by others

Sublinear Time Lempel-Ziv (LZ77) Factorization

Linear Time Lempel-Ziv Factorization: Simple, Fast, Small

New Advances in Rightmost Lempel-Ziv

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendix: List of Identifiers

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Lempel–Ziv Factorization Powered by Space Efficient Suffix Trees

Abstract

Access this article

Similar content being viewed by others

Sublinear Time Lempel-Ziv (LZ77) Factorization

Linear Time Lempel-Ziv Factorization: Simple, Fast, Small

New Advances in Rightmost Lempel-Ziv

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Appendix: List of Identifiers

Appendix: List of Identifiers

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation