Skip to main content
Log in

A space efficient solution to the frequent string mining problem for many databases

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

The frequent string mining problem is to find all substrings of a collection of string databases which satisfy database specific minimum and maximum frequency constraints. Our contribution improves the existing linear-time algorithm for this problem in such a way that the peak memory consumption is a constant factor of the size of the largest database of strings. We show how the results for each database can be stored implicitly in space proportional to the size of the database, making it possible to traverse the results in lexicographical order. Furthermore, we present a linear-time algorithm which calculates the intersection of the results of different databases. This algorithm is based on an algorithm to merge two suffix arrays, and our modification allows us to also calculate the LCP table of the resulting suffix array during the merging.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  • Abouelhoda MI, Kurtz S, Ohlebusch E (2004) Replacing suffix trees with enhanced suffix arrays. J Discrete Algorithms 2(1): 53–86

    Article  MATH  MathSciNet  Google Scholar 

  • Chang WI, Lawler EL (1994) Sublinear approximate string matching and biological applications. Algorithmica 12(4/5): 327–344

    Article  MATH  MathSciNet  Google Scholar 

  • Fischer J (2007) Linear frequent string miner and emerging substring miner (PKDD’06). http://www.bio.ifi.lmu.de/~fischer/frequentLinear.tgz

  • Fischer J, Heun V (2007) A new succinct representation of rmq-information and improvements in the enhanced suffix array. In: Chen B, Paterson M, Zhang G (eds) ESCAPE. Volume 4614 of lecture notes in computer science. Springer, pp 459–470

  • Fischer J, Heun V, Kramer S (2006) Optimal string mining under frequency constraints. In: Fürnkranz J, Scheffer T, Spiliopoulou M (eds) PKDD. Volume 4213 of lecture notes in computer science. Springer, pp 139–150

  • Gusfield D (1997) Algorithms on strings, trees and sequences: computer science and computational biology. Cambridge University Press

  • Hui LCK (1992) Color set size problem with application to string matching. In: Apostolico A, Crochemore M, Galil Z, Manber U (eds) CPM. Volume 644 of lecture notes in computer science. Springer, pp 230–243

  • Jeon JE, Park H, Kim DK (2005) Efficient construction of generalized suffix arrays by merging suffix arrays. J KISS: Comput Syst Theor 32(6): 268–278

    Google Scholar 

  • Kärkkäinen J, Sanders P (2003) Simple linear work suffix array construction. In: Baeten JCM, Lenstra JK, Parrow J, Woeginger GJ (eds) ICALP. Volume 2719 of lecture notes in computer science. Springer, pp 943–955

  • Kasai T, Lee G, Arimura H, Arikawa S, Park K (2001) Linear-time longest-common-prefix computation in suffix arrays and its applications. In: Amir A, Landau GM (eds) CPM. Volume 2089 of lecture notes in computer science. Springer, pp 181–192

  • Kim DK, Sim JS, Park H, Park K (2003) Linear-time construction of suffix arrays. In: Baeza-Yates RA, Chávez E, Crochemore M (eds) CPM. Volume 2676 of lecture notes in computer science. Springer, pp 186–199

  • Ko P, Aluru S (2003) Space efficient linear time construction of suffix arrays. In: Baeza-Yates RA, Chávez E, Crochemore M, (eds) CPM. Volume 2676 of lecture notes in computer science. Springer, pp 200–210

  • Maaß MG (2007) Computing suffix links for suffix trees and arrays. Inf Process Lett 101(6): 250–254

    Article  Google Scholar 

  • Manzini G, Ferragina P (2004) Engineering a lightweight suffix array construction algorithm. Algorithmica 40(1): 33–50

    Article  MATH  MathSciNet  Google Scholar 

  • NEWT Taxonomy Browser (2007) http://www.ebi.ac.uk/newt/

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Adrian Kügel.

Additional information

Responsible editors: Walter Daelemans, Bart Goethals, and Katharina Morik.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kügel, A., Ohlebusch, E. A space efficient solution to the frequent string mining problem for many databases. Data Min Knowl Disc 17, 24–38 (2008). https://doi.org/10.1007/s10618-008-0110-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-008-0110-5

Keywords

Navigation