Abstract
Full-text searching consists in locating the occurrences of a given pattern P[1..m] in a text T[1..u], both sequences over an alphabet of size σ. In this paper we define a new index for full-text searching on secondary storage, based on the Lempel-Ziv compression algorithm and requiring 8uH k + o(ulogσ) bits of space, where H k denotes the k-th order empirical entropy of T, for any k = o(log σ u). Our experimental results show that our index is significantly smaller than any other practical secondary-memory data structure: 1.4–2.3 times the text size including the text, which means 39%–65% the size of traditional indexes like String B-trees [Ferragina and Grossi, JACM 1999]. In exchange, our index requires more disk access to locate the pattern occurrences. Our index is able to report up to 600 occurrences per disk access, for a disk page of 32 kilobytes. If we only need to count pattern occurrences, the space can be reduced to about 1.04–1.68 times the text size, requiring about 20–60 disk accesses, depending on the pattern length.
Supported in part by CONICYT PhD Fellowship Program (first author) and Fondecyt Grant 1-050493 (second author).
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Apostolico, A.: The myriad virtues of subword trees. In: Combinatorial Algorithms on Words. NATO ISI Series, pp. 85–96. Springer, Heidelberg (1985)
Kurtz, S.: Reducing the space requeriments of suffix trees. Softw. Pract. Exper. 29(13), 1149–1171 (1999)
Manzini, G.: An analysis of the Burrows-Wheeler transform. JACM 48(3), 407–430 (2001)
Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Computing Surveys (to appear)
Ferragina, P., Manzini, G.: Indexing compressed texts. JACM 54(4), 552–581 (2005)
Moura, E., Navarro, G., Ziviani, N., Baeza-Yates, R.: Fast and flexible word searching on compressed text. ACM TOIS 18(2), 113–139 (2000)
Ferragina, P., Grossi, R.: The String B-tree: a new data structure for string search in external memory and its applications. JACM 46(2), 236–280 (1999)
Ferragina, P., Grossi, R.: Fast string searching in secondary storage: theoretical developments and experimental results. In: Proc. SODA, pp. 373–382 (1996)
Clark, D., Munro, J.I.: Efficient suffix trees on secondary storage. In: Proc. SODA, pp. 383–391 (1996)
Mäkinen, V., Navarro, G., Sadakane, K.: Advantages of backward searching — efficient secondary memory and distributed implementation of compressed suffix arrays. In: Proc. ISAAC, pp. 681–692 (2004)
Sadakane, K.: Succinct representations of lcp information and improvements in the compressed suffix arrays. In: Proc. SODA, pp. 225–232 (2002)
Navarro, G.: Indexing text using the Ziv-Lempel trie. J. of Discrete Algorithms 2(1), 87–114 (2004)
Ziv, J., Lempel, A.: Compression of individual sequences via variable-rate coding. IEEE TIT 24(5), 530–536 (1978)
Kosaraju, R., Manzini, G.: Compression of low entropy strings with Lempel-Ziv algorithms. SIAM J.Comp. 29(3), 893–911 (1999)
Arroyuelo, D., Navarro, G., Sadakane, K.: Reducing the space requirement of LZ-index. In: Proc. CPM, pp. 319–330 (2006)
Arroyuelo, D., Navarro, G.: Space-efficient construction of LZ-index. In: Proc. ISAAC pp. 1143–1152 (2005)
Munro, I., Raman, V.: Succinct representation of balanced parentheses and static trees. SIAM J.Comp. 31(3), 762–776 (2001)
Munro, I.: Tables. In: Chandru, V., Vinay, V. (eds.) Foundations of Software Technology and Theoretical Computer Science. LNCS, vol. 1180, pp. 37–42. Springer, Heidelberg (1996)
Arroyuelo, D., Navarro, G.: A Lempel-Ziv text index on secondary storage. Technical Report TR/DCC-2004, -4, Dept. of Computer Science, Universidad de Chile (2007), ftp://ftp.dcc.uchile.cl/pub/users/gnavarro/lzidisk.ps.gz
Morrison, D.R.: Patricia – practical algorithm to retrieve information coded in alphanumeric. JACM 15(4), 514–534 (1968)
Harman, D.: Overview of the third text REtrieval conference. In: Proc. Third Text REtrieval Conference (TREC-3), NIST Special Publication, pp. 500–207 (1995)
Baeza-Yates, R., Barbosa, E.F., Ziviani, N.: Hierarchies of indices for text searching. Inf. Systems 21(6), 497–514 (1996)
Manber, U., Myers, G.: Suffix arrays: A new method for on-line string searches. SIAM J. Comp. 22(5), 935–948 (1993)
González, R., Navarro, G.: Compressed text indexes with fast locate. In: Proc. of CPM’07. LNCS (to appear, 2007)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Arroyuelo, D., Navarro, G. (2007). A Lempel-Ziv Text Index on Secondary Storage . In: Ma, B., Zhang, K. (eds) Combinatorial Pattern Matching. CPM 2007. Lecture Notes in Computer Science, vol 4580. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-73437-6_11
Download citation
DOI: https://doi.org/10.1007/978-3-540-73437-6_11
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-73436-9
Online ISBN: 978-3-540-73437-6
eBook Packages: Computer ScienceComputer Science (R0)