Abstract
We present a novel algorithm for finding the longest factors in a text, for which the working space is proportional to the history text size. Moreover, our algorithm is online and exact; in that, unlike the previous batch algorithms [4, 5, 6, 7, 14], which needs to read the entire input beforehand, our algorithm reports the longest match just after reading each character. This algorithm can be directly used for data compression, pattern analysis, and data mining. Our algorithm also supports the window buffer, in that we can bound the working space by discarding the history from the oldest character. Using the dynamic rank/select dictionary [17], our algorithm requires n logσ + O(n logσ) + O(n) bits of working space, and O(log3 n) time per character, O(n log3 n) total time, n is the length of the history, and σ is the alphabet size. We implemented our algorithm and compared it with the recent algorithms [4, 5, 14] in terms of speed and the working space. We found that our algorithm can work with a smaller working space, less than 1/2 of those for the previous methods in real-world data, and with a reasonable decline in speed.
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Abouelhoda, M.I., Kurtz, S., Ohlebusch, E.: Replacing suffix trees with enhanced suffix arrays. Journal of Discrete Algorithms 2(1), 53–86 (2004)
Burrows, M., Wheeler, D.: A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation (1994)
Chan, H., Hon, W.K., Lam, T.W., Sadakane, K.: Compressed indexes for dynamic text collections. ACM Transactions on Algorithms 3(2), 21 (2007)
Chen, G., Puglisi, S.J., Smyth, W.F.: LZ factorization in less time and space. Mathematics in Computer Science (MCS) Special Issue on Combinatorial Algorithms (2008)
Chen, G., Puglisi, S.J., Smyth, W.: Fast and practical algorithms for computing all the runs in a string. In: Ma, B., Zhang, K. (eds.) CPM 2007. LNCS, vol. 4580, pp. 307–315. Springer, Heidelberg (2007)
Crochemore, M., Ilie, L.: LZ factorization in less time and space. Information Processing Letters 106, 75–80 (2008)
Crochemore, M., Ilie, L., Smyth, W.F.: A simple algorithm for computing the Lempel–Ziv factorization. In: DCC, pp. 482–488 (2008)
Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: Proc. of FOCS (2000)
Fischer, J., Heun, V.: Theoretical and practical improvements on the RMQ-problem, with applications to LCA and LCE. In: Lewenstein, M., Valiente, G. (eds.) CPM 2006. LNCS, vol. 4009, pp. 36–48. Springer, Heidelberg (2006)
Fischer, J., Heun, V.: A new succinct representation of rmq-information and improvements in the enhanced suffix array. In: Chen, B., Paterson, M., Zhang, G. (eds.) ESCAPE 2007. LNCS, vol. 4614. Springer, Heidelberg (2007)
Franek, F., Simpson, R.J., Smyth, W.F.: The maximum number of runs in a string. In: AWOCA, pp. 26–35 (2003)
Gonnet, G.H., Baeza-Yates, R., Snider, T.: New indices for text: PAT trees and PAT arrays. Information Retrieval: Algorithms and Data Structures, 66–82 (1992)
Kasai, T., Lee, G., Arimura, H., Arikawa, S., Park, K.: Linear-time longest-common-prefix computation in suffix arrays and its applications. In: Amir, A., Landau, G.M. (eds.) CPM 2001. LNCS, vol. 2089, pp. 181–192. Springer, Heidelberg (2001)
Kolpakov, R., Kucherov, G.: Mreps, http://bioinfo.lifl.fr/mreps/
Larsson, J.: Extended application of suffix trees to data compression. In: Proc. of DCC, pp. 190–199 (1996)
Larsson, J.: Structures of String Matching and Data Compression. PhD thesis, Lund University (1999)
Lee, S., Park, K.: Dynamic rank-select structures with applications to run-length encoded texts. In: Ma, B., Zhang, K. (eds.) CPM 2007. LNCS, vol. 4580, pp. 95–106. Springer, Heidelberg (2007)
Lippert, R., Mobarry, C., Walenz, B.: A space-efficient construction of the burrows wheeler transform for genomic data. Journal of Computational Biology (2005)
Manber, U., Myers, E.W.: Suffix arrays: A new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993)
Moffat, A.: An improved data structure for cumulative probability tables. Software: Practice and Experience 29, 647–659 (1999)
Mori, Y.: libdivsufsort, http://code.google.com/p/libdivsufsort/
Navarro, G., Mäkinen, V.: Compressed full-text indexes. ACM Computing Surveys 39(1) (2007)
Sadakane, K.: Succinct representations of LCP information and improvements in the compressed suffi arrays. In: ACM-SIAM SODA, pp. 225–232 (2002)
Sadakane, K.: Compressed suffix trees with full functionality. J. Theory of Computing Systems (2007)
Smyth, W.F.: http://www.cas.mcmaster.ca/~bill/strbings/
Weiner, P.: Linear pattern matching algorihms. In: Proceedings of the 14th IEEE Symposium on Switching and Automata Theory, pp. 1–11 (1973)
Ziv, J., Lempel, A.: A universal algorithm for sequential data compression. IEEE Transactions on Information Theory 23(3), 337–343 (1977)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Okanohara, D., Sadakane, K. (2008). An Online Algorithm for Finding the Longest Previous Factors. In: Halperin, D., Mehlhorn, K. (eds) Algorithms - ESA 2008. ESA 2008. Lecture Notes in Computer Science, vol 5193. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-87744-8_58
Download citation
DOI: https://doi.org/10.1007/978-3-540-87744-8_58
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-87743-1
Online ISBN: 978-3-540-87744-8
eBook Packages: Computer ScienceComputer Science (R0)