Abstract
This article proposes a succinct parallel algorithm, called pLZone, to compute the Lempel–Ziv (LZ77) factorization of a size-n input string over a constant alphabet in \({\mathcal {O}}(n)\) time using approximately a small n-word workspace, where each word occupies \(\lceil \mathrm{log}n\rceil\) bits. pLZone is designed by dividing the computing process of the sequential factorization algorithm LZone into multiple stages that are organized as a pipeline to perform operations in parallel for acceleration, and a checking method is integrated into the pipeline to efficiently verify the output to prevent bugs during implementation. A performance evaluation experiment is conducted by running pLZone and the existing representative algorithms on a set of realistic and artificial datasets. Both the best time and space results are achieved by our proposed algorithm, which suggests that this work could provide a potential solution for efficient LZ77 computation.
Similar content being viewed by others
References
Ziv J, Lempel A (1977) A universal algorithm for sequential data compression. IEEE Trans Inf Theory 23:337–343. https://doi.org/10.1109/TIT.1977.1055714
Yao K, Li H, Shang W et al (2020) A study of the performance of general compressors on log files. Empir Softw Eng 25(5):3043–3085
Puglisi SJ, Zhukova B (2020) Relative Lempel-Ziv compression of suffix arrays. In: Boucher C, Thankachan SV (eds). The 27th International Symposium on String Processing and Information Retrieval. Springer, pp 89–96
Sun X, Wu D, Mo D, et al (2021) Accelerating Knuth–Morris–Pratt string matching over LZ77 compressed text. In: Bilgin A, Marcellin M W, Serra-Sagristà J, Storer J A (eds). 2021 Data Compression Conference. IEEE, pp 372
Köppl D (2021) Non-overlapping LZ77 factorization and LZ78 substring compression queries with suffix trees. Algorithms 14(2):44. https://doi.org/10.3390/a14020044
Bannai H, Gagie T, Tomohiro I (2018) Online LZ77 parsing and matching statistics with RLBWTs. In: Navarro G, Sankoff D, Zhu B (eds). Annual Symposium on Combinatorial Pattern Matching. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 7:1–7:12
Deorowicz S, Grabowski S (2011) Robust relative compression of genomes with random access. Bioinform 27(21):2979–2986. https://doi.org/10.1093/bioinformatics/btr505
Liu WJ, Nong G, Chan WH et al (2016) Improving a lightweight LZ77 computation algorithm for running faster. Softw Pract Exp 46(9):1201–1217. https://doi.org/10.1002/spe.2377
Kärkkäinen J, Kempa D, Puglisi S J (2016) Lazy Lempel-Ziv factorization algorithms. ACM J Exp Algorithmics 21(1):2.4:1–2.4:19. https://doi.org/10.1145/2699876
Goto K, Bannai H (2014) Space efficient linear time Lempel-Ziv factorization for small alphabets. In: Bilgin A, Marcellin M W, Serra-Sagristà J, Storer J A (eds) 2014 Data Compression Conference. IEEE, pp 163–172
Shun J, Zhao J (2013) Practical parallel Lempel-Ziv factorization. In: Bilgin A, Marcellin M W, Serra-Sagristà J, Storer J A (eds) 2013 Data Compression Conference. IEEE, pp 123–132
Fisher J, I T, Köppl D et al (2018) Lempel-Ziv factorization powered by space efficient suffix trees. Algorithmica 80:2048–2081. https://doi.org/10.1007/s00453-017-0333-1
Köppl D (2021) Reversed Lempel-Ziv Factorization with suffix trees. Algorithms 14(6):161. https://doi.org/10.3390/a14060161
Golnaz Badkobeh G, Crochemore M et al (2012) Computing the Maximal-Exponent repeats of an overlap-free string in linear time. In: Calderón-Benavides L, González-Caro CN, Chávez E, Ziviani N (eds) The 19th International Symposium on String Processing and Information Retrieval. Springer, pp 61–72
Manber U, Myers G (1993) Suffix arrays: a new method for on-line string searches. SIAM J Comput 22(5):935–948. https://doi.org/10.1137/0222058
Nong G (2013) Practical linear-time O(1)-workspace suffix sorting for constant alphabets. ACM Trans Inf Syst 31(3):1–15. https://doi.org/10.1145/2493175.2493180
Kosolobov D, Valenzuela D, Köppl D et al (2020) Lempel-Ziv-Like parsing in small space. Algorithmica 82(11):3195–3215. https://doi.org/10.1007/s00453-020-00722-6
Fischer j, Gagie T, Gawrychowski P, et al (2015) Approximating LZ77 via small-space multiple-pattern matching. In: Nikhil Bansal N, Finocchi I (eds) Proceedings of the 23rd Annual European Symposium on Algorithms. Springer, pp 533–544
Gagie T, Navarro G, Prezza N, et al (2018) On the approximation ratio of Lempel-Ziv parsing. In: Bender M A, Farach-ColtonM, Mosteiro M A (eds) Proceedings of the 13th Latin American Symposium on Theoretical Informatics (LATIN). Springer, pp 490–503
Gagie G, Navarro G, Prezza N (2018) Optimal-time text indexing in BWT-runs bounded space. In: Czumaj A (eds) Proceedings of the Twenty-Ninth Symposium on Discrete Algorithms. SIAM, pp 1459–1477
Lao B, Nong G, Chan WH et al (2018) Fast in-place suffix sorting on a multicore computer. IEEE Trans Computers 67(12):1737–1749. https://doi.org/10.1109/TC.2018.2842050
Crochemore M, Ilie L (2008) Computing longest previous factor in linear time and applications. Inf Process Lett 106(2):75–80. https://doi.org/10.1016/j.ipl.2007.10.006
Liu WJ, Nong G, Chan WH et al (2015) Induced sorting suffixes in external memory with better design and less space. In: Iliopoulos CS, Puglisi SJ, Yilmaz E (eds) International Symposium on String Processing and Information Retrieval. Springer, pp 83–94
Bingmann T, Fischer J, Osipov V (2016) Inducing suffix and LCP arrays in external memory. ACM J Exp Algorithmics 21(1):2.3:1-2.3:27. https://doi.org/10.1145/2975593
Kempa D, Puglisi S J (2013) Lempel-Ziv factorization: simple, fast, practical. In: Sanders P, Zeh N (eds) Proceedings of the 15th Meeting on Algorithm Engineering and Experiments. SIAM, pp 103–112
Kärkkäinen J, Kempa D, Puglisi SJ (2013) Lightweight Lempel-Ziv parsing. In: Bonifaci V, Demetrescu C, Marchetti-Spaccamela A (eds) International Symposium on Experimental Algorithms. Springer, pp 139–150
Lao B, Nong G, Chan WH et al (2018) Fast induced sorting suffixes on a multicore machine. J SuperComput 74(7):3468–3485. https://doi.org/10.1007/s11227-018-2395-5
Wu Y, Nong G, Chan WH, Han LB (2017) Checking big suffix and LCP arrays by probabilistic methods. IEEE Trans Computers 66(10):1667–1675. https://doi.org/10.1109/TC.2017.2702642
Lao B, Nong G, Chan W H (2021) Building and checking suffix array simultaneously by induced sorting method. IEEE Trans Computers. doi: https://doi.org/10.1109/TC.2021.3061709
Karp RM, Rabin MO (1987) Efficient randomized pattern-matching algorithms. IBMJ Res Dev 31(2):249–260. https://doi.org/10.1147/rd.312.0249
Uzgalis R (1996) Hashing concepts and the java programming language. Technical Report, University of Auckland, New Zealand
Shun J, Blelloch G E, Fineman J T, et al (2012) Brief announcement: the problem based benchmark suite. In: Proceedings of the Twenty-fourth Annual ACM Symposium on Parallelism in Algorithms and Architectures. ACM, pp 68–70
Kärkkäinen J, Sanders P (2003) Simple linear work suffix array construction. In: Proceedings of the 30th International Conference on Automata, Languages and Programming. Springer, pp 943–955
Acknowledgements
This work was supported by the National Natural Science Foundation of China (Grant No. 61872391), the Guangzhou Science and Technology Program (Grant No. 201802010011), and the Foundation for Young Talents in Higher Education of Guangdong, China (Grant No. 2019KQNCX031).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Han, L.B., Lao, B. & Nong, G. Succinct parallel Lempel–Ziv factorization on a multicore computer. J Supercomput 78, 7278–7303 (2022). https://doi.org/10.1007/s11227-021-04165-w
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-021-04165-w