A space efficient solution to the frequent string mining problem for many databases

Kügel, Adrian; Ohlebusch, Enno

doi:10.1007/s10618-008-0110-5

A space efficient solution to the frequent string mining problem for many databases

Published: 09 July 2008

Volume 17, pages 24–38, (2008)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Adrian Kügel¹ &
Enno Ohlebusch¹

201 Accesses
Explore all metrics

Abstract

The frequent string mining problem is to find all substrings of a collection of string databases which satisfy database specific minimum and maximum frequency constraints. Our contribution improves the existing linear-time algorithm for this problem in such a way that the peak memory consumption is a constant factor of the size of the largest database of strings. We show how the results for each database can be stored implicitly in space proportional to the size of the database, making it possible to traverse the results in lexicographical order. Furthermore, we present a linear-time algorithm which calculates the intersection of the results of different databases. This algorithm is based on an algorithm to merge two suffix arrays, and our modification allows us to also calculate the LCP table of the resulting suffix array during the merging.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Artificial Intelligence

References

Abouelhoda MI, Kurtz S, Ohlebusch E (2004) Replacing suffix trees with enhanced suffix arrays. J Discrete Algorithms 2(1): 53–86
Article MATH MathSciNet Google Scholar
Chang WI, Lawler EL (1994) Sublinear approximate string matching and biological applications. Algorithmica 12(4/5): 327–344
Article MATH MathSciNet Google Scholar
Fischer J (2007) Linear frequent string miner and emerging substring miner (PKDD’06). http://www.bio.ifi.lmu.de/~fischer/frequentLinear.tgz
Fischer J, Heun V (2007) A new succinct representation of rmq-information and improvements in the enhanced suffix array. In: Chen B, Paterson M, Zhang G (eds) ESCAPE. Volume 4614 of lecture notes in computer science. Springer, pp 459–470
Fischer J, Heun V, Kramer S (2006) Optimal string mining under frequency constraints. In: Fürnkranz J, Scheffer T, Spiliopoulou M (eds) PKDD. Volume 4213 of lecture notes in computer science. Springer, pp 139–150
Gusfield D (1997) Algorithms on strings, trees and sequences: computer science and computational biology. Cambridge University Press
Hui LCK (1992) Color set size problem with application to string matching. In: Apostolico A, Crochemore M, Galil Z, Manber U (eds) CPM. Volume 644 of lecture notes in computer science. Springer, pp 230–243
Jeon JE, Park H, Kim DK (2005) Efficient construction of generalized suffix arrays by merging suffix arrays. J KISS: Comput Syst Theor 32(6): 268–278
Google Scholar
Kärkkäinen J, Sanders P (2003) Simple linear work suffix array construction. In: Baeten JCM, Lenstra JK, Parrow J, Woeginger GJ (eds) ICALP. Volume 2719 of lecture notes in computer science. Springer, pp 943–955
Kasai T, Lee G, Arimura H, Arikawa S, Park K (2001) Linear-time longest-common-prefix computation in suffix arrays and its applications. In: Amir A, Landau GM (eds) CPM. Volume 2089 of lecture notes in computer science. Springer, pp 181–192
Kim DK, Sim JS, Park H, Park K (2003) Linear-time construction of suffix arrays. In: Baeza-Yates RA, Chávez E, Crochemore M (eds) CPM. Volume 2676 of lecture notes in computer science. Springer, pp 186–199
Ko P, Aluru S (2003) Space efficient linear time construction of suffix arrays. In: Baeza-Yates RA, Chávez E, Crochemore M, (eds) CPM. Volume 2676 of lecture notes in computer science. Springer, pp 200–210
Maaß MG (2007) Computing suffix links for suffix trees and arrays. Inf Process Lett 101(6): 250–254
Article Google Scholar
Manzini G, Ferragina P (2004) Engineering a lightweight suffix array construction algorithm. Algorithmica 40(1): 33–50
Article MATH MathSciNet Google Scholar
NEWT Taxonomy Browser (2007) http://www.ebi.ac.uk/newt/

Download references

Author information

Authors and Affiliations

Faculty of Engineering and Computer Sciences, University of Ulm, 89069, Ulm, Germany
Adrian Kügel & Enno Ohlebusch

Authors

Adrian Kügel
View author publications
You can also search for this author inPubMed Google Scholar
Enno Ohlebusch
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Adrian Kügel.

Additional information

Responsible editors: Walter Daelemans, Bart Goethals, and Katharina Morik.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kügel, A., Ohlebusch, E. A space efficient solution to the frequent string mining problem for many databases. Data Min Knowl Disc 17, 24–38 (2008). https://doi.org/10.1007/s10618-008-0110-5

Download citation

Received: 20 June 2008
Accepted: 23 June 2008
Published: 09 July 2008
Issue Date: August 2008
DOI: https://doi.org/10.1007/s10618-008-0110-5

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A space efficient solution to the frequent string mining problem for many databases

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

AS-Index: A Structure for String Search Using n-Grams and Algebraic Signatures

Efficient Computation of Substring Equivalence Classes with Suffix Arrays

Suffix sorting via matching statistics

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

A space efficient solution to the frequent string mining problem for many databases

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

AS-Index: A Structure for String Search Using n-Grams and Algebraic Signatures

Efficient Computation of Substring Equivalence Classes with Suffix Arrays

Suffix sorting via matching statistics

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now