ABSTRACT
Frequent string mining is widely used in text processing to extract text features. Most researchers have focused on text using single-byte characters. Consequently, their applications have problems when applied to text represented with multibyte characters such as Japanese and Chinese text. The main drawback is huge memory us-age for treating multibyte character strings. To solve this problem,we use wavelet tree-based compressed suffix arrays instead of the normal suffix array to reduce the memory usage, and a novel technique that utilizes the rank operation to improve runtime efficiency.Our experimental evaluation shows that the proposed method reduces the processing time by 45% compared with a method usingonly compressed suffix arrays. The proposed method also reduces the memory usage by 75%.
- M. Burrows and D. Wheeler. A block-sorting lossless data compression algorithm. In DIGITAL SRC RESEARCH REPORT. Citeseer, 1994.Google Scholar
- L. De Raedt, M. Jaeger, S. D. Lee, and H. Mannila. A theory of inductive query answering. In Data Mining, 2002. ICDM2003. Proceedings. 2002 IEEE International Conference on,pages 123--130. IEEE, 2002. Google ScholarDigital Library
- P. Ferragina and G. Manzini. Opportunistic data structures with applications. In Foundations of Computer Science,2000. Proceedings. 41st Annual Symposium on, pages 390--398. IEEE, 2000. Google ScholarDigital Library
- J. Fischer, V. Heun, and S. Kramer. Fast frequent string mining using suffix arrays. In Proceedings of the Fifth IEEE International Conference on Data Mining, pages 609--612.IEEE, 2005. Google ScholarDigital Library
- R. Grossi, A. Gupta, and J. S. Vitter. High-order entropy-compressed text indexes. In Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms, pages 841--850. Society for Industrial and Applied Mathematics, 2003. Google ScholarDigital Library
- T. Kasai, G. Lee, H. Arimura, S. Arikawa, and K. Park.Linear-time longest-common-prefix computation in suffix arrays and its applications. In Combinatorial pattern matching, pages 181--192. Springer, 2001. Google ScholarDigital Library
- U. Manber and G. Myers. Suffix arrays: a new method for on-line string searches. SIAM Journal on Computing, 22(5):935--948, 1993. Google ScholarDigital Library
- G. Navarro. Wavelet trees for all.J. of Discrete Algorithms,25:2--20, Mar. 2014. Google ScholarDigital Library
- D. Okanohara and J. Tsujii. Text categorization with all substring features. In SIAM International Conference on Data Mining, pages 838--846. SIAM, 2009.Google ScholarCross Ref
- S. J. Puglisi, W. F. Smyth, and A. H. Turpin. A taxonomy of suffix array construction algorithms. ACM Comput. Surv., 39(2), July 2007. Google ScholarDigital Library
- R. K. Wong, F. Shi, and N. Lam. Full-text search on multi-byte encoded documents. In Proceedings of the 2012 ACM symposium on Document engineering, pages 227--236. ACM, 2012. Google ScholarDigital Library
Index Terms
- Frequent Multi-Byte Character Subtring Extraction using a Succinct Data Structure
Recommendations
Computing Longest Previous Factor in linear time and applications
We give two optimal linear-time algorithms for computing the Longest Previous Factor (LPF) array corresponding to a string w. For any position i in w, LPF[i] gives the length of the longest factor of w starting at position i that occurs previously in w. ...
On the number of elements to reorder when updating a suffix array
Recently new algorithms appeared for updating the Burrows-Wheeler Transform or the suffix array, when the text they index is modified. These algorithms proceed by reordering entries and the number of such reordered entries may be as high as the length ...
A Simple Algorithm for Computing the Lempel Ziv Factorization
DCC '08: Proceedings of the Data Compression ConferenceWe give a space-efficient simple algorithm for computing the Lempel--Ziv factorization of a string. For a string of length n over an integer alphabet, it runs in O(n) time independently of alphabet size and uses o(n) additional space.
Comments