Abstract
Length-q substrings, or q-grams, can represent important characteristics of text data, and determining the frequencies of all q-grams contained in the data is an important problem with many applications in the field of data mining and machine learning. In this paper, we consider the problem of calculating the non-overlapping frequencies of all q-grams in a text given in compressed form, namely, as a straight line program (SLP). We show that the problem can be solved in O(q 2 n) time and O(qn) space where n is the size of the SLP. This generalizes and greatly improves previous work (Inenaga & Bannai, 2009) which solved the problem only for q = 2 in O(n 4logn) time and O(n 3) space.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Amir, A., Benson, G.: Efficient two-dimensional compressed matching. In: Proc. DCC 1992, pp. 279–288 (1992)
Apostolico, A., Preparata, F.P.: Data structures and algorithms for the string statistics problem. Algorithmica 15(5), 481–494 (1996)
Bille, P., Landau, G.M., Raman, R., Sadakane, K., Satti, S.R., Weimann, O.: Random access to grammar-compressed strings. In: Proc. SODA 2011, pp. 373–389 (2011)
Brodal, G.S., Lyngsø, R.B., Östlin, A., Pedersen, C.N.S.: Solving the String Statistics Problem in Time O(n logn). In: Widmayer, P., Triguero, F., Morales, R., Hennessy, M., Eidenbenz, S., Conejo, R. (eds.) ICALP 2002. LNCS, vol. 2380, pp. 728–739. Springer, Heidelberg (2002)
Goto, K., Bannai, H., Inenaga, S., Takeda, M.: Towards efficient mining and classification on compressed strings. In: Accepted for SPIRE 2011 (2011), preprint available at arXiv:1103.3114v2
Hermelin, D., Landau, G.M., Landau, S., Weimann, O.: A unified algorithm for accelerating edit-distance computation via text-compression. In: Proc. STACS 2009, pp. 529–540 (2009)
Inenaga, S., Bannai, H.: Finding characteristic substring from compressed texts. In: Proc. The Prague Stringology Conference 2009, pp. 40–54 (2009); full version to appear in the International Journal of Foundations of Computer Science
Karpinski, M., Rytter, W., Shinohara, A.: An efficient pattern-matching algorithm for strings with short descriptions. Nordic Journal of Computing 4, 172–186 (1997)
Knuth, D.E., Morris, J.H., Pratt, V.R.: Fast pattern matching in strings. SIAM Journal on Computing 6(2), 323–350 (1977)
Larsson, N.J., Moffat, A.: Off-line dictionary-based compression. Proceedings of the IEEE 88(11), 1722–1732 (2000)
Lifshits, Y.: Processing Compressed Texts: A Tractability Border. In: Ma, B., Zhang, K. (eds.) CPM 2007. LNCS, vol. 4580, pp. 228–240. Springer, Heidelberg (2007)
Matsubara, W., Inenaga, S., Ishino, A., Shinohara, A., Nakamura, T., Hashimoto, K.: Efficient algorithms to compute compressed longest common substrings and compressed palindromes. Theoretical Computer Science 410(8-10), 900–913 (2009)
Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Computing Surveys 39(1), 2 (2007)
Nevill-Manning, C.G., Witten, I.H., Maulsby, D.L.: Compression by induction of hierarchical grammars. In: Proc. DCC 1994, pp. 244–253 (1994)
Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Transactions on Information Theory IT-23(3), 337–349 (1977)
Ziv, J., Lempel, A.: Compression of individual sequences via variable-length coding. IEEE Transactions on Information Theory 24(5), 530–536 (1978)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Goto, K., Bannai, H., Inenaga, S., Takeda, M. (2012). Computing q-Gram Non-overlapping Frequencies on SLP Compressed Texts. In: Bieliková, M., Friedrich, G., Gottlob, G., Katzenbeisser, S., Turán, G. (eds) SOFSEM 2012: Theory and Practice of Computer Science. SOFSEM 2012. Lecture Notes in Computer Science, vol 7147. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-27660-6_25
Download citation
DOI: https://doi.org/10.1007/978-3-642-27660-6_25
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-27659-0
Online ISBN: 978-3-642-27660-6
eBook Packages: Computer ScienceComputer Science (R0)