skip to main content
10.1145/2960811.2967161acmconferencesArticle/Chapter ViewAbstractPublication PagesdocengConference Proceedingsconference-collections
short-paper

Frequent Multi-Byte Character Subtring Extraction using a Succinct Data Structure

Published:13 September 2016Publication History

ABSTRACT

Frequent string mining is widely used in text processing to extract text features. Most researchers have focused on text using single-byte characters. Consequently, their applications have problems when applied to text represented with multibyte characters such as Japanese and Chinese text. The main drawback is huge memory us-age for treating multibyte character strings. To solve this problem,we use wavelet tree-based compressed suffix arrays instead of the normal suffix array to reduce the memory usage, and a novel technique that utilizes the rank operation to improve runtime efficiency.Our experimental evaluation shows that the proposed method reduces the processing time by 45% compared with a method usingonly compressed suffix arrays. The proposed method also reduces the memory usage by 75%.

References

  1. M. Burrows and D. Wheeler. A block-sorting lossless data compression algorithm. In DIGITAL SRC RESEARCH REPORT. Citeseer, 1994.Google ScholarGoogle Scholar
  2. L. De Raedt, M. Jaeger, S. D. Lee, and H. Mannila. A theory of inductive query answering. In Data Mining, 2002. ICDM2003. Proceedings. 2002 IEEE International Conference on,pages 123--130. IEEE, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. P. Ferragina and G. Manzini. Opportunistic data structures with applications. In Foundations of Computer Science,2000. Proceedings. 41st Annual Symposium on, pages 390--398. IEEE, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. J. Fischer, V. Heun, and S. Kramer. Fast frequent string mining using suffix arrays. In Proceedings of the Fifth IEEE International Conference on Data Mining, pages 609--612.IEEE, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. R. Grossi, A. Gupta, and J. S. Vitter. High-order entropy-compressed text indexes. In Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms, pages 841--850. Society for Industrial and Applied Mathematics, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. T. Kasai, G. Lee, H. Arimura, S. Arikawa, and K. Park.Linear-time longest-common-prefix computation in suffix arrays and its applications. In Combinatorial pattern matching, pages 181--192. Springer, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. U. Manber and G. Myers. Suffix arrays: a new method for on-line string searches. SIAM Journal on Computing, 22(5):935--948, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. G. Navarro. Wavelet trees for all.J. of Discrete Algorithms,25:2--20, Mar. 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. D. Okanohara and J. Tsujii. Text categorization with all substring features. In SIAM International Conference on Data Mining, pages 838--846. SIAM, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  10. S. J. Puglisi, W. F. Smyth, and A. H. Turpin. A taxonomy of suffix array construction algorithms. ACM Comput. Surv., 39(2), July 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. R. K. Wong, F. Shi, and N. Lam. Full-text search on multi-byte encoded documents. In Proceedings of the 2012 ACM symposium on Document engineering, pages 227--236. ACM, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Frequent Multi-Byte Character Subtring Extraction using a Succinct Data Structure

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in
            • Published in

              cover image ACM Conferences
              DocEng '16: Proceedings of the 2016 ACM Symposium on Document Engineering
              September 2016
              222 pages
              ISBN:9781450344388
              DOI:10.1145/2960811

              Copyright © 2016 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 13 September 2016

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • short-paper

              Acceptance Rates

              DocEng '16 Paper Acceptance Rate11of35submissions,31%Overall Acceptance Rate178of537submissions,33%
            • Article Metrics

              • Downloads (Last 12 months)2
              • Downloads (Last 6 weeks)0

              Other Metrics

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader