Skip to main content

Fast q-gram Mining on SLP Compressed Strings

  • Conference paper
String Processing and Information Retrieval (SPIRE 2011)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7024))

Included in the following conference series:

  • 769 Accesses

Abstract

We present simple and efficient algorithms for calculating q-gram frequencies on strings represented in compressed form, namely, as a straight line program (SLP). Given an SLP of size n that represents string T, we present an O(qn) time and space algorithm that computes the occurrence frequencies of all q-grams in T. Computational experiments show that our algorithm and its variation are practical for small q, actually running faster on various real string data, compared to algorithms that work on the uncompressed text. We also discuss applications in data mining and classification of string data, for which our algorithms can be useful.

This work was supported by KAKENHI 22680014 (HB).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Amir, A., Benson, G.: Efficient two-dimensional compressed matching. In: Proc. Data Compression Conference (DCC 1992), pp. 279ā€“288 (1992)

    Google Scholar 

  2. Arimura, H., Wataki, A., Fujino, R., Arikawa, S.: A fast algorithm for discovering optimal string patterns in large text databases. In: Richter, M.M., Smith, C.H., Wiehagen, R., Zeugmann, T. (eds.) ALT 1998. LNCS (LNAI), vol. 1501, pp. 247ā€“261. Springer, Heidelberg (1998)

    Chapter  Google Scholar 

  3. Brazma, A., Jonassen, I., Eidhammer, I., Gilbert, D.: Approaches to the automatic discovery of patterns in biosequences. J. Comp. Biol. 5(2), 279ā€“305 (1998)

    Article  Google Scholar 

  4. Chan, S., Kao, B., Yip, C.L., Tang, M.: Mining emerging substrings. In: Proc. 8th International Conference on Database Systems for Advanced Applications (DASFAA 2003), p. 119 (2003)

    Google Scholar 

  5. Charikar, M., Lehman, E., Liu, D., Panigrahy, R., Prabhakaran, M., Sahai, A., abhi shelat: The smallest grammar problem. IEEE Transactions on Information Theory 51(7), 2554ā€“2576 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  6. Claude, F., Navarro, G.: Self-indexed grammar-based compression. In: Fundamenta Informaticae (to appear), preliminary version: Proc. MFCS 2009, pp. 235ā€“246 (2009)

    Google Scholar 

  7. Gusfield, D.: Algorithms on Strings, Trees, and Sequences. Cambridge University Press, Cambridge (1997)

    Book  MATH  Google Scholar 

  8. Hermelin, D., Landau, G.M., Landau, S., Weimann, O.: A unified algorithm for accelerating edit-distance computation via text-compression. In: Proc. STACS 2009, pp. 529ā€“540 (2009)

    Google Scholar 

  9. Hui, L.C.K.: Color set size problem with application to string matching. In: Apostolico, A., Galil, Z., Manber, U., Crochemore, M. (eds.) CPM 1992. LNCS, vol. 644, pp. 230ā€“243. Springer, Heidelberg (1992)

    Chapter  Google Scholar 

  10. Inenaga, S., Bannai, H.: Finding characteristic substring from compressed texts. In: Proc. The Prague Stringology Conference 2009, pp. 40ā€“54 (2009)

    Google Scholar 

  11. KƤrkkƤinen, J., Sanders, P.: Simple linear work suffix array construction. In: Baeten, J.C.M., Lenstra, J.K., Parrow, J., Woeginger, G.J. (eds.) ICALP 2003. LNCS, vol. 2719, pp. 943ā€“955. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  12. Karpinski, M., Rytter, W., Shinohara, A.: An efficient pattern-matching algorithm for strings with short descriptions. Nordic Journal of Computing 4, 172ā€“186 (1997)

    MathSciNet  MATH  Google Scholar 

  13. Kasai, T., Lee, G.H., Arimura, H., Arikawa, S., Park, K.: Linear-Time Longest-Common-Prefix Computation in Suffix Arrays and Its Applications. In: Amir, A., Landau, G.M. (eds.) CPM 2001. LNCS, vol. 2089, pp. 181ā€“192. Springer, Heidelberg (2001)

    Chapter  Google Scholar 

  14. Kida, T., Shibata, Y., Takeda, M., Shinohara, A., Arikawa, S.: Collage system: A unifying framework for compressed pattern matching. Theoret. Comput. Sci. 298, 253ā€“272 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  15. Larsson, N.J., Moffat, A.: Offline dictionary-based compression. In: Proc. Data Compression Conference (DCC 1999), pp. 296ā€“305 (1999)

    Google Scholar 

  16. Leslie, C., Eskin, E., Noble, W.S.: The spectrum kernel: A string kernel for SVM protein classification. In: Pacific Symposium on Biocomputing, vol. 7, pp. 566ā€“575 (2002)

    Google Scholar 

  17. Lifshits, Y.: Processing compressed texts: A tractability border. In: Ma, B., Zhang, K. (eds.) CPM 2007. LNCS, vol. 4580, pp. 228ā€“240. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  18. Manber, U., Myers, G.: Suffix arrays: A new method for on-line string searches. SIAM J. Computing 22(5), 935ā€“948 (1993)

    Article  MathSciNet  MATH  Google Scholar 

  19. Matsubara, W., Inenaga, S., Ishino, A., Shinohara, A., Nakamura, T., Hashimoto, K.: Efficient algorithms to compute compressed longest common substrings and compressed palindromes. Theoret. Comput. Sci. 410(8ā€“10), 900ā€“913 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  20. Navarro, G., MƤkinen, V.: Compressed full-text indexes. ACM Computing Surveys 39(1), 2 (2007)

    Article  MATH  Google Scholar 

  21. Nevill-Manning, C.G., Witten, I.H., Maulsby, D.L.: Compression by induction of hierarchical grammars. In: Proc. Data Compression Conference (DCC 1994), pp. 244ā€“253 (1994)

    Google Scholar 

  22. Rytter, W.: Application of Lempel-Ziv factorization to the approximation of grammar-based compression. Theoret. Comput. Sci. 302(1ā€“3), 211ā€“222 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  23. Shibata, Y., Kida, T., Fukamachi, S., Takeda, M., Shinohara, A., Shinohara, T., Arikawa, S.: Speeding up pattern matching by text compression. In: Bongiovanni, G., Petreschi, R., Gambosi, G. (eds.) CIAC 2000. LNCS, vol. 1767, pp. 306ā€“315. Springer, Heidelberg (2000)

    Chapter  Google Scholar 

  24. Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Transactions on Information Theory IT-23(3), 337ā€“349 (1977)

    Article  MathSciNet  MATH  Google Scholar 

  25. Ziv, J., Lempel, A.: Compression of individual sequences via variable-length coding. IEEE Transactions on Information Theory 24(5), 530ā€“536 (1978)

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

Ā© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Goto, K., Bannai, H., Inenaga, S., Takeda, M. (2011). Fast q-gram Mining on SLP Compressed Strings. In: Grossi, R., Sebastiani, F., Silvestri, F. (eds) String Processing and Information Retrieval. SPIRE 2011. Lecture Notes in Computer Science, vol 7024. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-24583-1_27

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-24583-1_27

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-24582-4

  • Online ISBN: 978-3-642-24583-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics