Abstract
Suffix trees and suffix arrays are important data structures for string processing, providing efficient solutions for many applications involving pattern matching. Recent work by Sinha et al. (SIGMOD 2008) addressed the problem of arranging a suffix array on disk so that querying is fast, and showed that the combination of a small trie and a suffix array-like blocked data structure allows queries to be answered many times faster than alternative disk-based suffix trees. A drawback of their LOF-SA structure, and common to all current disk resident suffix tree/array approaches, is that the space requirement of the data structure, though on disk, is large relative to the text – for the LOF-SA, 13n bytes including the underlying n byte text. In this paper we explore techniques for reducing the space required by the LOF-SA. Experiments show these methods cut the data structure to nearly half its original size, without, for large strings that necessitate on-disk structures, any impact on search times.
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
This work was supported by the Australian Research Council.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Abouelhoda, M.I., Kurtz, S., Ohlebusch, E.: Replacing suffix trees with enhanced suffix arrays. Journal of Discrete Algorithms 2(1), 53–86 (2004)
Anh, V.N., Moffat, A.: Inverted index compression using word-aligned binary codes. Information Retrieval 8(1), 151–166 (2005)
Apostolico, A.: The myriad virtues of subword trees. In: Apostolico, A., Galil, Z. (eds.) Combinatorial Algorithms on Words. NATO ASI Series F12, pp. 85–96. Springer, Berlin (1985)
Baeza-Yates, R.A., Barbosa, E.F., Ziviani, N.: Hierarchies of indices for text searching. Information Systems 21(6), 497–514 (1996)
Brisaboa, N.R., Fariña, A., Navarro, G., Esteller, M.F. (S,C)-dense coding: An optimized compression code for natural language text databases. In: Nascimento, M.A., de Moura, E.S., Oliveira, A.L. (eds.) SPIRE 2003. LNCS, vol. 2857, pp. 122–136. Springer, Heidelberg (2003)
Cheung, C.-F., Xu Yu, J., Lu, H.: Constructing suffix tree for gigabyte sequences with megabyte memory. IEEE Transactions on Knowledge and Data Engineering 17(1), 90–105 (2005)
Crauser, A., Ferragina, P.: A theoretical and experimental study on the construction of suffix arrays in external memory. Algorithmica 32, 1–35 (2002)
Culpepper, J.S., Moffat, A.: Enhanced byte codes with restricted prefix properties. In: Consens, M.P., Navarro, G. (eds.) SPIRE 2005. LNCS, vol. 3772, pp. 1–12. Springer, Heidelberg (2005)
Dementiev, R., Kärkkäinen, J., Mehnert, J., Sanders, P.: Better external memory suffix array construction. ACM Journal of Experimental Algorithmics 12(3.4), 1–24 (2008)
Farach-Colton, M., Ferragina, P., Muthukrishnan, S.: On the sorting-complexity of suffix tree construction. Journal of the ACM 47(6), 987–1011 (2000)
González, R., Navarro, G.: Compressed text indexes with fast locate. In: Ma, B., Zhang, K. (eds.) CPM 2007. LNCS, vol. 4580, pp. 216–227. Springer, Heidelberg (2007)
Gusfield, D.: Algorithms on strings, trees, and sequences: Computer science and computational biology. Cambridge University Press, Cambridge (1997)
Kasai, T., Lee, G.H., Arimura, H., Arikawa, S., Park, K.: Linear-time longest-common-prefix computation in suffix arrays and its applications. In: Amir, A., Landau, G.M. (eds.) CPM 2001. LNCS, vol. 2089, pp. 181–192. Springer, Heidelberg (2001)
Larsson, J., Moffat, A.: Off-line dictionary-based compression. Proceedings of the IEEE 88(11), 1722–1732 (2000)
Manber, U., Myers, G.W.: Suffix arrays: a new method for on-line string searches. SIAM Journal of Computing 22(5), 935–948 (1993)
Manzini, G.: Two space saving tricks for linear time LCP array computation. In: Hagerup, T., Katajainen, J. (eds.) SWAT 2004. LNCS, vol. 3111, pp. 372–383. Springer, Heidelberg (2004)
McCreight, E.M.: A space-economical suffix tree construction algroithm. Journal of the ACM 23(2), 262–272 (1976)
Navarro, G., Mäkinen, V.: Compressed full text indexes. ACM Computing Surveys 39(1) (2007)
Phoophakdee, B., Zaki, M.J.: Genome-scale disk-based suffix tree indexing. In: Chan, C.Y., Ooi, B.C., Zhou, A. (eds.) Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, pp. 833–844. ACM, New York (2007)
Sinha, R., Puglisi, S.J., Moffat, A., Turpin, A.: Improving suffix array locality for fast pattern matching on disk. In: Lakshmanan, L.V.S., Ng, R.T., Shasha, D. (eds.) Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, pp. 661–672. ACM, New York (2008)
Smyth, B.: Computing Patterns in Strings. Pearson Addison-Wesley, Essex (2003)
Tian, Y., Tata, S., Hankins, R.A., Patel, J.M.: Practical methods for constructing suffix trees. The VLDB Journal 14(3), 281–299 (2005)
Ukkonen, E.: Online construction of suffix trees. Algorithmica 14(3), 249–260 (1995)
Weiner, P.: Linear pattern matching algorithms. In: Proceedings of the 14th Annual Symposium on Switching and Automata Theory, pp. 1–11. IEEE Computer Society, Washington (1973)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Moffat, A., Puglisi, S.J., Sinha, R. (2009). Reducing Space Requirements for Disk Resident Suffix Arrays . In: Zhou, X., Yokota, H., Deng, K., Liu, Q. (eds) Database Systems for Advanced Applications. DASFAA 2009. Lecture Notes in Computer Science, vol 5463. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-00887-0_63
Download citation
DOI: https://doi.org/10.1007/978-3-642-00887-0_63
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-00886-3
Online ISBN: 978-3-642-00887-0
eBook Packages: Computer ScienceComputer Science (R0)