Skip to main content
Log in

Succinct parallel Lempel–Ziv factorization on a multicore computer

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

This article proposes a succinct parallel algorithm, called pLZone, to compute the Lempel–Ziv (LZ77) factorization of a size-n input string over a constant alphabet in \({\mathcal {O}}(n)\) time using approximately a small n-word workspace, where each word occupies \(\lceil \mathrm{log}n\rceil\) bits. pLZone is designed by dividing the computing process of the sequential factorization algorithm LZone into multiple stages that are organized as a pipeline to perform operations in parallel for acceleration, and a checking method is integrated into the pipeline to efficiently verify the output to prevent bugs during implementation. A performance evaluation experiment is conducted by running pLZone and the existing representative algorithms on a set of realistic and artificial datasets. Both the best time and space results are achieved by our proposed algorithm, which suggests that this work could provide a potential solution for efficient LZ77 computation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. KKP: https://www.cs.helsinki.fi/group/pads/lz77.html, BGone: http://code.google.com/p/bgone, pLZ77: https://github.com/zfy0701/Parallel-LZ77/tree/release.

References

  1. Ziv J, Lempel A (1977) A universal algorithm for sequential data compression. IEEE Trans Inf Theory 23:337–343. https://doi.org/10.1109/TIT.1977.1055714

    Article  MathSciNet  MATH  Google Scholar 

  2. Yao K, Li H, Shang W et al (2020) A study of the performance of general compressors on log files. Empir Softw Eng 25(5):3043–3085

    Article  Google Scholar 

  3. Puglisi SJ, Zhukova B (2020) Relative Lempel-Ziv compression of suffix arrays. In: Boucher C, Thankachan SV (eds). The 27th International Symposium on String Processing and Information Retrieval. Springer, pp 89–96

  4. Sun X, Wu D, Mo D, et al (2021) Accelerating Knuth–Morris–Pratt string matching over LZ77 compressed text. In: Bilgin A, Marcellin M W, Serra-Sagristà J, Storer J A (eds). 2021 Data Compression Conference. IEEE, pp 372

  5. Köppl D (2021) Non-overlapping LZ77 factorization and LZ78 substring compression queries with suffix trees. Algorithms 14(2):44. https://doi.org/10.3390/a14020044

    Article  MathSciNet  Google Scholar 

  6. Bannai H, Gagie T, Tomohiro I (2018) Online LZ77 parsing and matching statistics with RLBWTs. In: Navarro G, Sankoff D, Zhu B (eds). Annual Symposium on Combinatorial Pattern Matching. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 7:1–7:12

  7. Deorowicz S, Grabowski S (2011) Robust relative compression of genomes with random access. Bioinform 27(21):2979–2986. https://doi.org/10.1093/bioinformatics/btr505

    Article  Google Scholar 

  8. Liu WJ, Nong G, Chan WH et al (2016) Improving a lightweight LZ77 computation algorithm for running faster. Softw Pract Exp 46(9):1201–1217. https://doi.org/10.1002/spe.2377

    Article  Google Scholar 

  9. Kärkkäinen J, Kempa D, Puglisi S J (2016) Lazy Lempel-Ziv factorization algorithms. ACM J Exp Algorithmics 21(1):2.4:1–2.4:19. https://doi.org/10.1145/2699876

  10. Goto K, Bannai H (2014) Space efficient linear time Lempel-Ziv factorization for small alphabets. In: Bilgin A, Marcellin M W, Serra-Sagristà J, Storer J A (eds) 2014 Data Compression Conference. IEEE, pp 163–172

  11. Shun J, Zhao J (2013) Practical parallel Lempel-Ziv factorization. In: Bilgin A, Marcellin M W, Serra-Sagristà J, Storer J A (eds) 2013 Data Compression Conference. IEEE, pp 123–132

  12. Fisher J, I T, Köppl D et al (2018) Lempel-Ziv factorization powered by space efficient suffix trees. Algorithmica 80:2048–2081. https://doi.org/10.1007/s00453-017-0333-1

    Article  MathSciNet  MATH  Google Scholar 

  13. Köppl D (2021) Reversed Lempel-Ziv Factorization with suffix trees. Algorithms 14(6):161. https://doi.org/10.3390/a14060161

    Article  Google Scholar 

  14. Golnaz Badkobeh G, Crochemore M et al (2012) Computing the Maximal-Exponent repeats of an overlap-free string in linear time. In: Calderón-Benavides L, González-Caro CN, Chávez E, Ziviani N (eds) The 19th International Symposium on String Processing and Information Retrieval. Springer, pp 61–72

  15. Manber U, Myers G (1993) Suffix arrays: a new method for on-line string searches. SIAM J Comput 22(5):935–948. https://doi.org/10.1137/0222058

    Article  MathSciNet  MATH  Google Scholar 

  16. Nong G (2013) Practical linear-time O(1)-workspace suffix sorting for constant alphabets. ACM Trans Inf Syst 31(3):1–15. https://doi.org/10.1145/2493175.2493180

    Article  Google Scholar 

  17. Kosolobov D, Valenzuela D, Köppl D et al (2020) Lempel-Ziv-Like parsing in small space. Algorithmica 82(11):3195–3215. https://doi.org/10.1007/s00453-020-00722-6

    Article  MathSciNet  MATH  Google Scholar 

  18. Fischer j, Gagie T, Gawrychowski P, et al (2015) Approximating LZ77 via small-space multiple-pattern matching. In: Nikhil Bansal N, Finocchi I (eds) Proceedings of the 23rd Annual European Symposium on Algorithms. Springer, pp 533–544

  19. Gagie T, Navarro G, Prezza N, et al (2018) On the approximation ratio of Lempel-Ziv parsing. In: Bender M A, Farach-ColtonM, Mosteiro M A (eds) Proceedings of the 13th Latin American Symposium on Theoretical Informatics (LATIN). Springer, pp 490–503

  20. Gagie G, Navarro G, Prezza N (2018) Optimal-time text indexing in BWT-runs bounded space. In: Czumaj A (eds) Proceedings of the Twenty-Ninth Symposium on Discrete Algorithms. SIAM, pp 1459–1477

  21. Lao B, Nong G, Chan WH et al (2018) Fast in-place suffix sorting on a multicore computer. IEEE Trans Computers 67(12):1737–1749. https://doi.org/10.1109/TC.2018.2842050

    Article  MathSciNet  MATH  Google Scholar 

  22. Crochemore M, Ilie L (2008) Computing longest previous factor in linear time and applications. Inf Process Lett 106(2):75–80. https://doi.org/10.1016/j.ipl.2007.10.006

    Article  MathSciNet  MATH  Google Scholar 

  23. Liu WJ, Nong G, Chan WH et al (2015) Induced sorting suffixes in external memory with better design and less space. In: Iliopoulos CS, Puglisi SJ, Yilmaz E (eds) International Symposium on String Processing and Information Retrieval. Springer, pp 83–94

  24. Bingmann T, Fischer J, Osipov V (2016) Inducing suffix and LCP arrays in external memory. ACM J Exp Algorithmics 21(1):2.3:1-2.3:27. https://doi.org/10.1145/2975593

    Article  MathSciNet  MATH  Google Scholar 

  25. Kempa D, Puglisi S J (2013) Lempel-Ziv factorization: simple, fast, practical. In: Sanders P, Zeh N (eds) Proceedings of the 15th Meeting on Algorithm Engineering and Experiments. SIAM, pp 103–112

  26. Kärkkäinen J, Kempa D, Puglisi SJ (2013) Lightweight Lempel-Ziv parsing. In: Bonifaci V, Demetrescu C, Marchetti-Spaccamela A (eds) International Symposium on Experimental Algorithms. Springer, pp 139–150

  27. Lao B, Nong G, Chan WH et al (2018) Fast induced sorting suffixes on a multicore machine. J SuperComput 74(7):3468–3485. https://doi.org/10.1007/s11227-018-2395-5

    Article  Google Scholar 

  28. Wu Y, Nong G, Chan WH, Han LB (2017) Checking big suffix and LCP arrays by probabilistic methods. IEEE Trans Computers 66(10):1667–1675. https://doi.org/10.1109/TC.2017.2702642

    Article  MathSciNet  MATH  Google Scholar 

  29. Lao B, Nong G, Chan W H (2021) Building and checking suffix array simultaneously by induced sorting method. IEEE Trans Computers. doi: https://doi.org/10.1109/TC.2021.3061709

  30. Karp RM, Rabin MO (1987) Efficient randomized pattern-matching algorithms. IBMJ Res Dev 31(2):249–260. https://doi.org/10.1147/rd.312.0249

    Article  MathSciNet  MATH  Google Scholar 

  31. Uzgalis R (1996) Hashing concepts and the java programming language. Technical Report, University of Auckland, New Zealand

  32. Shun J, Blelloch G E, Fineman J T, et al (2012) Brief announcement: the problem based benchmark suite. In: Proceedings of the Twenty-fourth Annual ACM Symposium on Parallelism in Algorithms and Architectures. ACM, pp 68–70

  33. Kärkkäinen J, Sanders P (2003) Simple linear work suffix array construction. In: Proceedings of the 30th International Conference on Automata, Languages and Programming. Springer, pp 943–955

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China (Grant No. 61872391), the Guangzhou Science and Technology Program (Grant No. 201802010011), and the Foundation for Young Talents in Higher Education of Guangdong, China (Grant No. 2019KQNCX031).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ge Nong.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Han, L.B., Lao, B. & Nong, G. Succinct parallel Lempel–Ziv factorization on a multicore computer. J Supercomput 78, 7278–7303 (2022). https://doi.org/10.1007/s11227-021-04165-w

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-021-04165-w

Keywords

Navigation