Abstract
The frequent string mining problem is to find all substrings of a collection of string databases which satisfy database specific minimum and maximum frequency constraints. Our contribution improves the existing linear-time algorithm for this problem in such a way that the peak memory consumption is a constant factor of the size of the largest database of strings. We show how the results for each database can be stored implicitly in space proportional to the size of the database, making it possible to traverse the results in lexicographical order. Furthermore, we present a linear-time algorithm which calculates the intersection of the results of different databases. This algorithm is based on an algorithm to merge two suffix arrays, and our modification allows us to also calculate the LCP table of the resulting suffix array during the merging.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Abouelhoda MI, Kurtz S, Ohlebusch E (2004) Replacing suffix trees with enhanced suffix arrays. J Discrete Algorithms 2(1): 53–86
Chang WI, Lawler EL (1994) Sublinear approximate string matching and biological applications. Algorithmica 12(4/5): 327–344
Fischer J (2007) Linear frequent string miner and emerging substring miner (PKDD’06). http://www.bio.ifi.lmu.de/~fischer/frequentLinear.tgz
Fischer J, Heun V (2007) A new succinct representation of rmq-information and improvements in the enhanced suffix array. In: Chen B, Paterson M, Zhang G (eds) ESCAPE. Volume 4614 of lecture notes in computer science. Springer, pp 459–470
Fischer J, Heun V, Kramer S (2006) Optimal string mining under frequency constraints. In: Fürnkranz J, Scheffer T, Spiliopoulou M (eds) PKDD. Volume 4213 of lecture notes in computer science. Springer, pp 139–150
Gusfield D (1997) Algorithms on strings, trees and sequences: computer science and computational biology. Cambridge University Press
Hui LCK (1992) Color set size problem with application to string matching. In: Apostolico A, Crochemore M, Galil Z, Manber U (eds) CPM. Volume 644 of lecture notes in computer science. Springer, pp 230–243
Jeon JE, Park H, Kim DK (2005) Efficient construction of generalized suffix arrays by merging suffix arrays. J KISS: Comput Syst Theor 32(6): 268–278
Kärkkäinen J, Sanders P (2003) Simple linear work suffix array construction. In: Baeten JCM, Lenstra JK, Parrow J, Woeginger GJ (eds) ICALP. Volume 2719 of lecture notes in computer science. Springer, pp 943–955
Kasai T, Lee G, Arimura H, Arikawa S, Park K (2001) Linear-time longest-common-prefix computation in suffix arrays and its applications. In: Amir A, Landau GM (eds) CPM. Volume 2089 of lecture notes in computer science. Springer, pp 181–192
Kim DK, Sim JS, Park H, Park K (2003) Linear-time construction of suffix arrays. In: Baeza-Yates RA, Chávez E, Crochemore M (eds) CPM. Volume 2676 of lecture notes in computer science. Springer, pp 186–199
Ko P, Aluru S (2003) Space efficient linear time construction of suffix arrays. In: Baeza-Yates RA, Chávez E, Crochemore M, (eds) CPM. Volume 2676 of lecture notes in computer science. Springer, pp 200–210
Maaß MG (2007) Computing suffix links for suffix trees and arrays. Inf Process Lett 101(6): 250–254
Manzini G, Ferragina P (2004) Engineering a lightweight suffix array construction algorithm. Algorithmica 40(1): 33–50
NEWT Taxonomy Browser (2007) http://www.ebi.ac.uk/newt/
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editors: Walter Daelemans, Bart Goethals, and Katharina Morik.
Rights and permissions
About this article
Cite this article
Kügel, A., Ohlebusch, E. A space efficient solution to the frequent string mining problem for many databases. Data Min Knowl Disc 17, 24–38 (2008). https://doi.org/10.1007/s10618-008-0110-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-008-0110-5